Skip to content
  1. Dec 01, 2009
    • David Howells's avatar
      SLOW_WORK: Move slow_work's proc file to debugfs · f13a48bd
      David Howells authored
      
      
      Move slow_work's debugging proc file to debugfs.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Requested-and-acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f13a48bd
    • David Howells's avatar
      SLOW_WORK: Fix the CONFIG_MODULES=n case · fa1dae49
      David Howells authored
      
      
      Commits 3d7a641e ("SLOW_WORK: Wait for outstanding work items belonging to a
      module to clear") introduced some code to make sure that all of a module's
      slow-work items were complete before that module was removed, and commit
      3bde31a4 ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is
      needed") further extended that, breaking it in the process if CONFIG_MODULES=n:
      
          CC      kernel/slow-work.o
        kernel/slow-work.c: In function 'slow_work_execute':
        kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function)
        kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once
        kernel/slow-work.c:313: error: for each function it appears in.)
        kernel/slow-work.c: In function 'slow_work_wait_for_items':
        kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function)
        kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function)
        kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function)
        kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function)
        kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function)
        make[1]: *** [kernel/slow-work.o] Error 1
      
      Fix this by:
      
       (1) Extracting the bits of slow_work_execute() that are contingent on
           CONFIG_MODULES, and the bits that should be, into inline functions and
           placing them into the #ifdef'd section that defines the relevant variables
           and adding stubs for moduleless kernels.  This allows the removal of some
           #ifdefs.
      
       (2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless
           kernels.
      
      The four functions related to handling module unloading synchronisation (and
      their associated variables) could be offloaded into a separate .c file, but
      each function is only used once and three of them are tiny, so doing so would
      prevent them from being inlined.
      
      Reported-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa1dae49
  2. Nov 19, 2009
    • David Howells's avatar
      SLOW_WORK: Allow a requeueable work item to sleep till the thread is needed · 3bde31a4
      David Howells authored
      
      
      Add a function to allow a requeueable work item to sleep till the thread
      processing it is needed by the slow-work facility to perform other work.
      
      Sometimes a work item can't progress immediately, but must wait for the
      completion of another work item that's currently being processed by another
      slow-work thread.
      
      In some circumstances, the waiting item could instead - theoretically - put
      itself back on the queue and yield its thread back to the slow-work facility,
      thus waiting till it gets processing time again before attempting to progress.
      This would allow other work items processing time on that thread.
      
      However, this only works if there is something on the queue for it to queue
      behind - otherwise it will just get a thread again immediately, and will end
      up cycling between the queue and the thread, eating up valuable CPU time.
      
      So, slow_work_sleep_till_thread_needed() is provided such that an item can put
      itself on a wait queue that will wake it up when the event it is actually
      interested in occurs, then call this function in lieu of calling schedule().
      
      This function will then sleep until either the item's event occurs or another
      work item appears on the queue.  If another work item is queued, but the
      item's event hasn't occurred, then the work item should requeue itself and
      yield the thread back to the slow-work facility by returning.
      
      This can be used by CacheFiles for an object that is being created on one
      thread to wait for an object being deleted on another thread where there is
      nothing on the queue for the creation to go and wait behind.  As soon as an
      item appears on the queue that could be given thread time instead, CacheFiles
      can stick the creating object back on the queue and return to the slow-work
      facility - assuming the object deletion didn't also complete.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      3bde31a4
    • David Howells's avatar
      SLOW_WORK: Allow the work items to be viewed through a /proc file · 8fba10a4
      David Howells authored
      
      
      Allow the executing and queued work items to be viewed through a /proc file
      for debugging purposes.  The contents look something like the following:
      
          THR PID   ITEM ADDR        FL MARK  DESC
          === ===== ================ == ===== ==========
            0  3005 ffff880023f52348  a 952ms FSC: OBJ17d3: LOOK
            1  3006 ffff880024e33668  2 160ms FSC: OBJ17e5 OP60d3b: Write1/Store fl=2
            2  3165 ffff8800296dd180  a 424ms FSC: OBJ17e4: LOOK
            3  4089 ffff8800262c8d78  a 212ms FSC: OBJ17ea: CRTN
            4  4090 ffff88002792bed8  2 388ms FSC: OBJ17e8 OP60d36: Write1/Store fl=2
            5  4092 ffff88002a0ef308  2 388ms FSC: OBJ17e7 OP60d2e: Write1/Store fl=2
            6  4094 ffff88002abaf4b8  2 132ms FSC: OBJ17e2 OP60d4e: Write1/Store fl=2
            7  4095 ffff88002bb188e0  a 388ms FSC: OBJ17e9: CRTN
          vsq     - ffff880023d99668  1 308ms FSC: OBJ17e0 OP60f91: Write1/EnQ fl=2
          vsq     - ffff8800295d1740  1 212ms FSC: OBJ16be OP4d4b6: Write1/EnQ fl=2
          vsq     - ffff880025ba3308  1 160ms FSC: OBJ179a OP58dec: Write1/EnQ fl=2
          vsq     - ffff880024ec83e0  1 160ms FSC: OBJ17ae OP599f2: Write1/EnQ fl=2
          vsq     - ffff880026618e00  1 160ms FSC: OBJ17e6 OP60d33: Write1/EnQ fl=2
          vsq     - ffff880025a2a4b8  1 132ms FSC: OBJ16a2 OP4d583: Write1/EnQ fl=2
          vsq     - ffff880023cbe6d8  9 212ms FSC: OBJ17eb: LOOK
          vsq     - ffff880024d37590  9 212ms FSC: OBJ17ec: LOOK
          vsq     - ffff880027746cb0  9 212ms FSC: OBJ17ed: LOOK
          vsq     - ffff880024d37ae8  9 212ms FSC: OBJ17ee: LOOK
          vsq     - ffff880024d37cb0  9 212ms FSC: OBJ17ef: LOOK
          vsq     - ffff880025036550  9 212ms FSC: OBJ17f0: LOOK
          vsq     - ffff8800250368e0  9 212ms FSC: OBJ17f1: LOOK
          vsq     - ffff880025036aa8  9 212ms FSC: OBJ17f2: LOOK
      
      In the 'THR' column, executing items show the thread they're occupying and
      queued threads indicate which queue they're on.  'PID' shows the process ID of
      a slow-work thread that's executing something.  'FL' shows the work item flags.
      'MARK' indicates how long since an item was queued or began executing.  Lastly,
      the 'DESC' column permits the owner of an item to give some information.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      8fba10a4
    • Jens Axboe's avatar
      SLOW_WORK: Add delayed_slow_work support · 6b8268b1
      Jens Axboe authored
      
      
      This adds support for starting slow work with a delay, similar
      to the functionality we have for workqueues.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      6b8268b1
    • Jens Axboe's avatar
      SLOW_WORK: Add support for cancellation of slow work · 01609502
      Jens Axboe authored
      
      
      Add support for cancellation of queued slow work and delayed slow work items.
      The cancellation functions will wait for items that are pending or undergoing
      execution to be discarded by the slow work facility.
      
      Attempting to enqueue work that is in the process of being cancelled will
      result in ECANCELED.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      01609502
    • Jens Axboe's avatar
      SLOW_WORK: Make slow_work_ops ->get_ref/->put_ref optional · 4d8bb2cb
      Jens Axboe authored
      
      
      Make the ability for the slow-work facility to take references on a work item
      optional as not everyone requires this.
      
      Even the internal slow-work stubs them out, so those can be got rid of too.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      4d8bb2cb
    • David Howells's avatar
      SLOW_WORK: Wait for outstanding work items belonging to a module to clear · 3d7a641e
      David Howells authored
      
      
      Wait for outstanding slow work items belonging to a module to clear when
      unregistering that module as a user of the facility.  This prevents the put_ref
      code of a work item from being taken away before it returns.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      3d7a641e
  3. Nov 18, 2009
  4. Nov 12, 2009
  5. Nov 08, 2009
  6. Nov 07, 2009
    • Yong Zhang's avatar
      genirq: try_one_irq() must be called with irq disabled · e7e7e0c0
      Yong Zhang authored
      
      
      Prarit reported:
      =================================
      [ INFO: inconsistent lock state ]
      2.6.32-rc5 #1
      ---------------------------------
      inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
       (&irq_desc_lock_class){?.-...}, at: [<ffffffff810c264e>] try_one_irq+0x32/0x138
      {IN-HARDIRQ-W} state was registered at:
       [<ffffffff81095160>] __lock_acquire+0x2fc/0xd5d
       [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
       [<ffffffff814cdadd>] _spin_lock+0x40/0x89
       [<ffffffff810c3389>] handle_level_irq+0x30/0x105
       [<ffffffff81014e0e>] handle_irq+0x95/0xb7
       [<ffffffff810141bd>] do_IRQ+0x6a/0xe0
       [<ffffffff81012813>] ret_from_intr+0x0/0x16
      irq event stamp: 195096
      hardirqs last  enabled at (195096): [<ffffffff814cd7f7>] _spin_unlock_irq+0x3a/0x5c
      hardirqs last disabled at (195095): [<ffffffff814cdbdd>] _spin_lock_irq+0x29/0x95
      softirqs last  enabled at (195088): [<ffffffff81068c92>] __do_softirq+0x1c1/0x1ef
      softirqs last disabled at (195093): [<ffffffff8101304c>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
      1 lock held by swapper/0:
       #0:  (kernel/irq/spurious.c:21){+.-...}, at: [<ffffffff81070cf2>]
      run_timer_softirq+0x1a9/0x315
      
      stack backtrace:
      Pid: 0, comm: swapper Not tainted 2.6.32-rc5 #1
      Call Trace:
       <IRQ>  [<ffffffff81093e94>] valid_state+0x187/0x1ae
       [<ffffffff81093fe4>] mark_lock+0x129/0x253
       [<ffffffff810951d4>] __lock_acquire+0x370/0xd5d
       [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
       [<ffffffff814cdadd>] _spin_lock+0x40/0x89
       [<ffffffff810c264e>] try_one_irq+0x32/0x138
       [<ffffffff810c2795>] poll_all_shared_irqs+0x41/0x6d
       [<ffffffff810c27dd>] poll_spurious_irqs+0x1c/0x49
       [<ffffffff81070d82>] run_timer_softirq+0x239/0x315
       [<ffffffff81068bd3>] __do_softirq+0x102/0x1ef
       [<ffffffff8101304c>] call_softirq+0x1c/0x30
       [<ffffffff81014b65>] do_softirq+0x59/0xca
       [<ffffffff810686ad>] irq_exit+0x58/0xae
       [<ffffffff81029b84>] smp_apic_timer_interrupt+0x94/0xba
       [<ffffffff81012a33>] apic_timer_interrupt+0x13/0x20
      
      The reason is that try_one_irq() is called from hardirq context with
      interrupts disabled and from softirq context (poll_all_shared_irqs())
      with interrupts enabled.
      
      Disable interrupts before calling it from poll_all_shared_irqs().
      
      Reported-and-tested-by: default avatarPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      LKML-Reference: <1257563773-4620-1-git-send-email-yong.zhang0@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e7e7e0c0
  7. Nov 04, 2009
  8. Nov 03, 2009
    • Ian Campbell's avatar
      Correct nr_processes() when CPUs have been unplugged · 1d510750
      Ian Campbell authored
      nr_processes() returns the sum of the per cpu counter process_counts for
      all online CPUs. This counter is incremented for the current CPU on
      fork() and decremented for the current CPU on exit(). Since a process
      does not necessarily fork and exit on the same CPU the process_count for
      an individual CPU can be either positive or negative and effectively has
      no meaning in isolation.
      
      Therefore calculating the sum of process_counts over only the online
      CPUs omits the processes which were started or stopped on any CPU which
      has since been unplugged. Only the sum of process_counts across all
      possible CPUs has meaning.
      
      The only caller of nr_processes() is proc_root_getattr() which
      calculates the number of links to /proc as
              stat->nlink = proc_root.nlink + nr_processes();
      
      You don't have to be all that unlucky for the nr_processes() to return a
      negative value leading to a negative number of links (or rather, an
      apparently enormous number of links). If this happens then you can get
      failures where things like "ls /proc" start to fail because they got an
      -EOVERFLOW from some stat() call.
      
      Example with some debugging inserted to show what goes on:
              # ps haux|wc -l
              nr_processes: CPU0:     90
              nr_processes: CPU1:     1030
              nr_processes: CPU2:     -900
              nr_processes: CPU3:     -136
              nr_processes: TOTAL:    84
              proc_root_getattr. nlink 12 + nr_processes() 84 = 96
              84
              # echo 0 >/sys/devices/system/cpu/cpu1/online
              # ps haux|wc -l
              nr_processes: CPU0:     85
              nr_processes: CPU2:     -901
              nr_processes: CPU3:     -137
              nr_processes: TOTAL:    -953
              proc_root_getattr. nlink 12 + nr_processes() -953 = -941
              75
              # stat /proc/
              nr_processes: CPU0:     84
              nr_processes: CPU2:     -901
              nr_processes: CPU3:     -137
              nr_processes: TOTAL:    -954
              proc_root_getattr. nlink 12 + nr_processes() -954 = -942
                File: `/proc/'
                Size: 0               Blocks: 0          IO Block: 1024   directory
              Device: 3h/3d   Inode: 1           Links: 4294966354
              Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
              Access: 2009-11-03 09:06:55.000000000 +0000
              Modify: 2009-11-03 09:06:55.000000000 +0000
              Change: 2009-11-03 09:06:55.000000000 +0000
      
      I'm not 100% convinced that the per_cpu regions remain valid for offline
      CPUs, although my testing suggests that they do. If not then I think the
      correct solution would be to aggregate the process_count for a given CPU
      into a global base value in cpu_down().
      
      This bug appears to pre-date the transition to git and it looks like it
      may even have been present in linux-2.6.0-test7-bk3 since it looks like
      the code Rusty patched in http://lwn.net/Articles/64773/
      
       was already
      wrong.
      
      Signed-off-by: default avatarIan Campbell <ian.campbell@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d510750
    • Jiri Slaby's avatar
      PM / Hibernate: Add newline to load_image() fail path · bf9fd67a
      Jiri Slaby authored
      
      
      Finish a line by \n when load_image fails in the middle of loading.
      
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      bf9fd67a
    • Jiri Slaby's avatar
      PM / Hibernate: Fix error handling in save_image() · 4ff277f9
      Jiri Slaby authored
      
      
      There are too many retval variables in save_image(). Thus error return
      value from snapshot_read_next() may be ignored and only part of the
      snapshot (successfully) written.
      
      Remove 'error' variable, invert the condition in the do-while loop
      and convert the loop to use only 'ret' variable.
      
      Switch the rest of the function to consider only 'ret'.
      
      Also make sure we end printed line by \n if an error occurs.
      
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      4ff277f9
    • Jiri Slaby's avatar
      PM / Hibernate: Fix blkdev refleaks · 76b57e61
      Jiri Slaby authored
      
      
      While cruising through the swsusp code I found few blkdev reference
      leaks of resume_bdev.
      
      swsusp_read: remove blkdev_put altogether. Some fail paths do
                   not do that.
      swsusp_check: make sure we always put a reference on fail paths
      software_resume: all fail paths between swsusp_check and swsusp_read
                       omit swsusp_close. Add it in those cases. And since
                       swsusp_read doesn't drop the reference anymore, do
                       it here unconditionally.
      
      [rjw: Fixed a small coding style issue.]
      
      Signed-off-by: default avatarJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      76b57e61
    • Mike Galbraith's avatar
      sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c · b84ff7d6
      Mike Galbraith authored
      
      
      Eric Paris reported that commit
      f685ceac causes boot time
      PREEMPT_DEBUG complaints.
      
       [    4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
       [    4.593043] caller is task_hot+0x86/0xd0
      
      Since kthread_bind() messes with scheduler internals, move the
      body to sched.c, and lock the runqueue.
      
      Reported-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Tested-by: default avatarEric Paris <eparis@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
      [ v2: fix !SMP build and clean up ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      b84ff7d6
  9. Nov 02, 2009
    • Paul E. McKenney's avatar
      rcu: Fix long-grace-period race between forcing and initialization · 83f5b01f
      Paul E. McKenney authored
      
      
      Very long RCU read-side critical sections (50 milliseconds or
      so) can cause a race between force_quiescent_state() and
      rcu_start_gp() as follows on kernel builds with multi-level
      rcu_node hierarchies:
      
      1.	CPU 0 calls force_quiescent_state(), sees that there is a
      	grace period in progress, and acquires ->fsqlock.
      
      2.	CPU 1 detects the end of the grace period, and so
      	cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
      	This operation is carried out under the root rnp->lock,
      	but CPU 0 has not yet acquired that lock.  Note that
      	rsp->signaled is still RCU_SAVE_DYNTICK from the last
      	grace period.
      
      3.	CPU 1 calls rcu_start_gp(), but no one wants a new grace
      	period, so it drops the root rnp->lock and returns.
      
      4.	CPU 0 acquires the root rnp->lock and picks up rsp->completed
      	and rsp->signaled, then drops rnp->lock.  It then enters the
      	RCU_SAVE_DYNTICK leg of the switch statement.
      
      5.	CPU 2 invokes call_rcu(), and now needs a new grace period.
      	It calls rcu_start_gp(), which acquires the root rnp->lock, sets
      	rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
      	the RCU_SAVE_DYNTICK leg of the switch statement!)  and starts
      	initializing the rcu_node hierarchy.  If there are multiple
      	levels to the hierarchy, it will drop the root rnp->lock and
      	initialize the lower levels of the hierarchy.
      
      6.	CPU 0 notes that rsp->completed has not changed, which permits
              both CPU 2 and CPU 0 to try updating it concurrently.  If CPU 0's
      	update prevails, later calls to force_quiescent_state() can
      	count old quiescent states against the new grace period, which
      	can in turn result in premature ending of grace periods.
      
      	Not good.
      
      This patch adds an RCU_GP_IDLE state for rsp->signaled that is
      set initially at boot time and any time a grace period ends.
      This prevents CPU 0 from getting into the workings of
      force_quiescent_state() in step 4.  Additional locking and
      checks prevent the concurrent update of rsp->signaled in step 6.
      
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1256742889199-git-send-email->
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      83f5b01f
    • Thomas Gleixner's avatar
      uids: Prevent tear down race · b00bc0b2
      Thomas Gleixner authored
      
      
      Ingo triggered the following warning:
      
      WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
      Hardware name: System Product Name
      ODEBUG: init active object type: timer_list
      Modules linked in:
      Pid: 2619, comm: dmesg Tainted: G        W  2.6.32-rc5-tip+ #5298
      Call Trace:
       [<81035443>] warn_slowpath_common+0x6a/0x81
       [<8120e483>] ? debug_print_object+0x42/0x50
       [<81035498>] warn_slowpath_fmt+0x29/0x2c
       [<8120e483>] debug_print_object+0x42/0x50
       [<8120ec2a>] __debug_object_init+0x279/0x2d7
       [<8120ecb3>] debug_object_init+0x13/0x18
       [<810409d2>] init_timer_key+0x17/0x6f
       [<81041526>] free_uid+0x50/0x6c
       [<8104ed2d>] put_cred_rcu+0x61/0x72
       [<81067fac>] rcu_do_batch+0x70/0x121
      
      debugobjects warns about an enqueued timer being initialized. If
      CONFIG_USER_SCHED=y the user management code uses delayed work to
      remove the user from the hash table and tear down the sysfs objects.
      
      free_uid is called from RCU and initializes/schedules delayed work if
      the usage count of the user_struct is 0. The init/schedule happens
      outside of the uidhash_lock protected region which allows a concurrent
      caller of find_user() to reference the about to be destroyed
      user_struct w/o preventing the work from being scheduled. If the next
      free_uid call happens before the work timer expired then the active
      timer is initialized and the work scheduled again.
      
      The race was introduced in commit 5cb350ba (sched: group scheduling,
      sysfs tunables) and made more prominent by commit 3959214f (sched:
      delayed cleanup of user_struct)
      
      Move the init/schedule_delayed_work inside of the uidhash_lock
      protected region to prevent the race.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarDhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@us.ibm.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: stable@kernel.org
      b00bc0b2
    • Rusty Russell's avatar
      sched: Fix boot crash by zalloc()ing most of the cpu masks · 49557e62
      Rusty Russell authored
      
      
      I got a boot crash when forcing cpumasks offstack on 32 bit,
      because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
      wasn't zeroed).
      
      AFAICT the others need to be zeroed too: only
      nohz.ilb_grp_nohz_mask is initialized before use.
      
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      49557e62
  10. Oct 29, 2009
  11. Oct 28, 2009
    • Rusty Russell's avatar
      param: fix setting arrays of bool · 3c7d76e3
      Rusty Russell authored
      
      
      We create a dummy struct kernel_param on the stack for parsing each
      array element, but we didn't initialize the flags word.  This matters
      for arrays of type "bool", where the flag indicates if it really is
      an array of bools or unsigned int (old-style).
      
      Reported-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      3c7d76e3
    • Rusty Russell's avatar
      param: fix NULL comparison on oom · d553ad86
      Rusty Russell authored
      
      
      kp->arg is always true: it's the contents of that pointer we care about.
      
      Reported-by: default avatarTakashi Iwai <tiwai@suse.de>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      d553ad86
    • Rusty Russell's avatar
      param: fix lots of bugs with writing charp params from sysfs, by leaking mem. · 65afac7d
      Rusty Russell authored
      
      
      e180a6b7 "param: fix charp parameters set via sysfs" fixed the case
      where charp parameters written via sysfs were freed, leaving drivers
      accessing random memory.
      
      Unfortunately, storing a flag in the kparam struct was a bad idea: it's
      rodata so setting it causes an oops on some archs.  But that's not all:
      
      1) module_param_array() on charp doesn't work reliably, since we use an
         uninitialized temporary struct kernel_param.
      2) there's a fundamental race if a module uses this parameter and then
         it's changed: they will still access the old, freed, memory.
      
      The simplest fix (ie. for 2.6.32) is to never free the memory.  This
      prevents all these problems, at cost of a memory leak.  In practice, there
      are only 18 places where a charp is writable via sysfs, and all are
      root-only writable.
      
      Reported-by: default avatarTakashi Iwai <tiwai@suse.de>
      Cc: Sitsofe Wheeler <sitsofe@yahoo.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christof Schmitt <christof.schmitt@de.ibm.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      65afac7d
    • Thomas Gleixner's avatar
      futex: Fix spurious wakeup for requeue_pi really · 11df6ddd
      Thomas Gleixner authored
      
      
      The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
      NULL test) nor does it use the wake_list of futex_wake() which where
      the reason for commit 41890f2 (futex: Handle spurious wake up)
      
      See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
      
      The changes in this fix to the wait_requeue_pi path were considered to
      be a likely unecessary, but harmless safety net. But it turns out that
      due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
      as EAGAIN we built an endless loop in the code path which returns
      correctly EWOULDBLOCK.
      
      Spurious wakeups in wait_requeue_pi code path are unlikely so we do
      the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
      it deal with the spurious wakeup.
      
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      LKML-Reference: <4AE23C74.1090502@us.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      11df6ddd
    • Jiri Kosina's avatar
      sched: move rq_weight data array out of .percpu · 4a6cc4bd
      Jiri Kosina authored
      
      
      Commit 34d76c41 introduced percpu array update_shares_data, size of which
      being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
      NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
      
      Fix this by allocating this array dynamically and keep only pointer to it
      percpu.
      
      The per-cpu handling doesn't impose significant performance penalty on
      potentially contented path in tg_shares_up().
      
      ...
      ffffffff8104337c:       65 48 8b 14 25 20 cd    mov    %gs:0xcd20,%rdx
      ffffffff81043383:       00 00
      ffffffff81043385:       48 c7 c0 00 e1 00 00    mov    $0xe100,%rax
      ffffffff8104338c:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
      ffffffff81043393:       00
      ffffffff81043394:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
      ffffffff8104339b:       00
      ffffffff8104339c:       48 01 d0                add    %rdx,%rax
      ffffffff8104339f:       49 8d 94 24 08 01 00    lea    0x108(%r12),%rdx
      ffffffff810433a6:       00
      ffffffff810433a7:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
      ffffffff810433ac:       48 89 45 b0             mov    %rax,-0x50(%rbp)
      ffffffff810433b0:       bb 00 04 00 00          mov    $0x400,%ebx
      ffffffff810433b5:       48 89 55 c0             mov    %rdx,-0x40(%rbp)
      ...
      
      After:
      
      ...
      ffffffff8104337c:       65 8b 04 25 28 cd 00    mov    %gs:0xcd28,%eax
      ffffffff81043383:       00
      ffffffff81043384:       48 98                   cltq
      ffffffff81043386:       49 8d bc 24 08 01 00    lea    0x108(%r12),%rdi
      ffffffff8104338d:       00
      ffffffff8104338e:       48 8b 15 d3 7f 76 00    mov    0x767fd3(%rip),%rdx        # ffffffff817ab368 <update_shares_data>
      ffffffff81043395:       48 8b 34 c5 00 ee 6d    mov    -0x7e921200(,%rax,8),%rsi
      ffffffff8104339c:       81
      ffffffff8104339d:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
      ffffffff810433a4:       00
      ffffffff810433a5:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
      ffffffff810433aa:       48 89 7d c0             mov    %rdi,-0x40(%rbp)
      ffffffff810433ae:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
      ffffffff810433b5:       00
      ffffffff810433b6:       bb 00 04 00 00          mov    $0x400,%ebx
      ffffffff810433bb:       48 01 f2                add    %rsi,%rdx
      ffffffff810433be:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
      ...
      
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4a6cc4bd
  12. Oct 24, 2009
  13. Oct 23, 2009
    • Mike Galbraith's avatar
      sched: Strengthen buddies and mitigate buddy induced latencies · f685ceac
      Mike Galbraith authored
      
      
      This patch restores the effectiveness of LAST_BUDDY in preventing
      pgsql+oltp from collapsing due to wakeup preemption. It also
      switches LAST_BUDDY to exclusively do what it does best, namely
      mitigate the effects of aggressive wakeup preemption, which
      improves vmark throughput markedly, and restores mysql+oltp
      scalability.
      
      Since buddies are about scalability, enable them beginning at the
      point where we begin expanding sched_latency, namely
      sched_nr_latency. Previously, buddies were cleared aggressively,
      which seriously reduced their effectiveness. Not clearing
      aggressively however, produces a small drop in mysql+oltp
      throughput immediately after peak, indicating that LAST_BUDDY is
      actually doing some harm. This is right at the point where X on the
      desktop in competition with another load wants low latency service.
      Ergo, do not enable until we need to scale.
      
      To mitigate latency induced by buddies, or by a task just missing
      wakeup preemption, check latency at tick time.
      
      Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
      CACHE_HOT_BUDDY.
      
      Supporting performance tests:
      
       tip   = v2.6.32-rc5-1497-ga525b32
       tipx  = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
       tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs
      
      (Three run averages except where noted.)
      
       vmark:
       ------
       tip           108466 messages per second
       tip+          125307 messages per second
       tip+x         125335 messages per second
       tipx          117781 messages per second
       2.6.31.3      122729 messages per second
      
       mysql+oltp:
       -----------
       clients          1        2        4        8       16       32       64        128    256
       ..........................................................................................
       tip        9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
       tip+      10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
       tipx       9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
       2.6.31.3   8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86
      
       pgsql+oltp:
       -----------
       clients          1        2        4        8       16       32       64      128      256
       ..........................................................................................
       tip       13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
       tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
       tip+x     13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
       tipx      13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
       2.6.31.3  13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25
      
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f685ceac
    • Soeren Sandmann's avatar
      perf events: Don't generate events for the idle task when exclude_idle is set · 54f44076
      Soeren Sandmann authored
      
      
      Getting samples for the idle task is often not interesting, so
      don't generate them when exclude_idle is set for the event in
      question.
      
      Signed-off-by: default avatarSøren Sandmann Pedersen <sandmann@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      LKML-Reference: <ye8pr8fmlq7.fsf@camel16.daimi.au.dk>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      54f44076
    • Soeren Sandmann's avatar
      perf events: Fix swevent hrtimer sampling by keeping track of remaining time... · 721a669b
      Soeren Sandmann authored
      
      perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers
      
      Make the hrtimer based events work for sysprof.
      
      Whenever a swevent is scheduled out, the hrtimer is canceled.
      When it is scheduled back in, the timer is restarted. This
      happens every scheduler tick, which means the timer never
      expired because it was getting repeatedly restarted over and
      over with the same period.
      
      To fix that, save the remaining time when disabling; when
      reenabling, use that saved time as the period instead of the
      user-specified sampling period.
      
      Also, move the starting and stopping of the hrtimers to helper
      functions instead of duplicating the code.
      
      Signed-off-by: default avatarSøren Sandmann Pedersen <sandmann@redhat.com>
      LKML-Reference: <ye8vdi7mluz.fsf@camel16.daimi.au.dk>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      721a669b
  14. Oct 21, 2009
  15. Oct 19, 2009
    • Andi Kleen's avatar
      HWPOISON: Allow schedule_on_each_cpu() from keventd · 65a64464
      Andi Kleen authored
      
      
      Right now when calling schedule_on_each_cpu() from keventd there
      is a deadlock because it tries to schedule a work item on the current CPU
      too. This happens via lru_add_drain_all() in hwpoison.
      
      Just call the function for the current CPU in this case. This is actually
      faster too.
      
      Debugging with Fengguang Wu & Max Asbock
      
      Signed-off-by: default avatarAndi Kleen <ak@linux.intel.com>
      65a64464
Loading