Skip to content
  1. Jun 28, 2007
  2. Jun 24, 2007
  3. Jun 21, 2007
    • Thomas Gleixner's avatar
      posix-timers: Prevent softirq starvation by small intervals and SIG_IGN · 58229a18
      Thomas Gleixner authored
      
      
      posix-timers which deliver an ignored signal are currently rearmed in
      the timer softirq: This is necessary because the timer needs to be
      delivered again when SIG_IGN is removed. This is not a problem, when
      the interval is reasonable.
      
      With high resolution timers enabled one might arm a posix timer with a
      very small interval and ignore the signal. This might lead to a
      softirq starvation when the interval is so small that the timer is
      requeued onto the softirq pending list right away.
      
      This problem was pointed out by Jan Kiszka. Thanks Jan !
      
      The correct solution would be to stop the timer, when the signal is
      ignored and rearm it when SIG_IGN is removed. Unfortunately this
      requires modification in sigaction and involves non trivial sighand
      locking. It's too late in the release cycle for such a change.
      
      For now we just keep the timer running and enforce that the timer only
      fires every jiffie. This does not break anything as we keep the
      overrun counter correct. It adds a little inaccuracy to the
      timer_gettime() interface, but...
      
      The more complex change is necessary anyway to fix another short
      coming of the current implementation, which I discovered while looking
      at this problem: A pending signal is discarded when SIG_IGN is set. In
      case that a posixtimer signal is pending then it is discarded as well,
      but when SIG_IGN is removed later nothing rearms the timer. This is
      not new, it's that way since posix timers have been merged. So nothing
      to worry about right now.
      
      I have a working solution to fix all of this, but the impact is too
      large for both stable and 2.6.22. I'm going to send it out for review
      in the next days.
      
      This should go into 2.6.21.stable as well.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Jan Kiszka <jan.kiszka@web.de>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Stable Team <stable@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58229a18
  4. Jun 18, 2007
    • Linus Torvalds's avatar
      Fix possible runqueue lock starvation in wait_task_inactive() · fa490cfd
      Linus Torvalds authored
      
      
      Miklos Szeredi reported very long pauses (several seconds, sometimes
      more) on his T60 (with a Core2Duo) which he managed to track down to
      wait_task_inactive()'s open-coded busy-loop.
      
      He observed that an interrupt on one core tries to acquire the
      runqueue-lock but does not succeed in doing so for a very long time -
      while wait_task_inactive() on the other core loops waiting for the first
      core to deschedule a task (which it wont do while spinning in an
      interrupt handler).
      
      This rewrites wait_task_inactive() to do all its waiting optimistically
      without any locks taken at all, and then just double-check the end
      result with the proper runqueue lock held over just a very short
      section.  If there were races in the optimistic wait, of a preemption
      event scheduled the process away, we simply re-synchronize, and start
      over.
      
      So the code now looks like this:
      
      	repeat:
      		/* Unlocked, optimistic looping! */
      		rq = task_rq(p);
      		while (task_running(rq, p))
      			cpu_relax();
      
      		/* Get the *real* values */
      		rq = task_rq_lock(p, &flags);
      		running = task_running(rq, p);
      		array = p->array;
      		task_rq_unlock(rq, &flags);
      
      		/* Check them.. */
      		if (unlikely(running)) {
      			cpu_relax();
      			goto repeat;
      		}
      
      		/* Preempted away? Yield if so.. */
      		if (unlikely(array)) {
      			yield();
      			goto repeat;
      		}
      
      Basically, that first "while()" loop is done entirely without any
      locking at all (and doesn't check for the case where the target process
      might have been preempted away), and so it's possibly "incorrect", but
      we don't really care.  Both the runqueue used, and the "task_running()"
      check might be the wrong tests, but they won't oops - they just mean
      that we could possibly get the wrong results due to lack of locking and
      exit the loop early in the case of a race condition.
      
      So once we've exited the loop, we then get the proper (and careful) rq
      lock, and check the running/runnable state _safely_.  And if it turns
      out that our quick-and-dirty and unsafe loop was wrong after all, we
      just go back and try it all again.
      
      (The patch also adds a lot of comments, which is the actual bulk of it
      all, to make it more obvious why we can do these things without holding
      the locks).
      
      Thanks to Miklos for all the testing and tracking it down.
      
      Tested-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa490cfd
    • Ingo Molnar's avatar
      sched: fix SysRq-N (normalize RT tasks) · a0f98a1c
      Ingo Molnar authored
      
      
      Gene Heskett reported the following problem while testing CFS: SysRq-N
      is not always effective in normalizing tasks back to SCHED_OTHER.
      
      The reason for that turns out to be the following bug:
      
       - normalize_rt_tasks() uses for_each_process() to iterate through all
         tasks in the system.  The problem is, this method does not iterate
         through all tasks, it iterates through all thread groups.
      
      The proper mechanism to enumerate over all threads is to use a
      do_each_thread() + while_each_thread() loop.
      
      Reported-by: default avatarGene Heskett <gene.heskett@gmail.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a0f98a1c
    • Benjamin Herrenschmidt's avatar
      Fix signalfd interaction with thread-private signals · caec4e8d
      Benjamin Herrenschmidt authored
      
      
      Don't let signalfd dequeue private signals off other threads (in the
      case of things like SIGILL or SIGSEGV, trying to do so would result
      in undefined behaviour on who actually gets the signal, since they
      are force unblocked).
      
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: default avatarDavide Libenzi <davidel@xmailserver.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      caec4e8d
    • Thomas Gleixner's avatar
      Revert "futex_requeue_pi optimization" · bd197234
      Thomas Gleixner authored
      This reverts commit d0aa7a70.
      
      It not only introduced user space visible changes to the futex syscall,
      it is also non-functional and there is no way to fix it proper before
      the 2.6.22 release.
      
      The breakage report ( http://lkml.org/lkml/2007/5/12/17
      
       ) went
      unanswered, and unfortunately it turned out that the concept is not
      feasible at all.  It violates the rtmutex semantics badly by introducing
      a virtual owner, which hacks around the coupling of the user-space
      pi_futex and the kernel internal rt_mutex representation.
      
      At the moment the only safe option is to remove it fully as it contains
      user-space visible changes to broken kernel code, which we do not want
      to expose in the 2.6.22 release.
      
      The patch reverts the original patch mostly 1:1, but contains a couple
      of trivial manual cleanups which were necessary due to patches, which
      touched the same area of code later.
      
      Verified against the glibc tests and my own PI futex tests.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarUlrich Drepper <drepper@redhat.com>
      Cc: Pierre Peiffer <pierre.peiffer@bull.net>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bd197234
  5. Jun 16, 2007
    • Rafael J. Wysocki's avatar
      swsusp: Fix userland interface · 2f41dddb
      Rafael J. Wysocki authored
      
      
      Fix oops caused by 'cat /dev/snapshot', reported by Arkadiusz Miskiewicz,
      and make it impossible to thaw tasks with the help of the swsusp userland
      interface while there is a snapshot image ready to save.
      
      Signed-off-by: default avatarRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f41dddb
    • Paul Jackson's avatar
      cpuset: zero malloc - fix for old cpusets · 3e903e7b
      Paul Jackson authored
      
      
      The cpuset code to present a list of tasks using a cpuset to user space could
      write to an array that it had kmalloc'd, after a kmalloc request of zero size.
      
      The problem was that the code didn't check for writes past the allocated end
      of the array until -after- the first write.
      
      This is a race condition that is likely rare -- it would only show up if a
      cpuset went from being empty to having a task in it, during the brief time
      between the allocation and the first write.
      
      Prior to roughly 2.6.22 kernels, this was also a benign problem, because a
      zero kmalloc returned a few usable bytes anyway, and no harm was done with the
      bogus write.
      
      With the 2.6.22 kernel changes to make issue a warning if code tries to write
      to the location returned from a zero size allocation, this problem is no
      longer benign.  This cpuset code would occassionally trigger that warning.
      
      The fix is trivial -- check before storing into the array, not after, whether
      the array is big enough to hold the store.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Serge E. Hallyn" <serue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3e903e7b
  6. Jun 09, 2007
    • Alexey Kuznetsov's avatar
      pi-futex: fix exit races and locking problems · 778e9a9c
      Alexey Kuznetsov authored
      
      
      1. New entries can be added to tsk->pi_state_list after task completed
         exit_pi_state_list(). The result is memory leakage and deadlocks.
      
      2. handle_mm_fault() is called under spinlock. The result is obvious.
      
      3. results in self-inflicted deadlock inside glibc.
         Sometimes futex_lock_pi returns -ESRCH, when it is not expected
         and glibc enters to for(;;) sleep() to simulate deadlock. This problem
         is quite obvious and I think the patch is right. Though it looks like
         each "if" in futex_lock_pi() got some stupid special case "else if". :-)
      
      4. sometimes futex_lock_pi() returns -EDEADLK,
         when nobody has the lock. The reason is also obvious (see comment
         in the patch), but correct fix is far beyond my comprehension.
         I guess someone already saw this, the chunk:
      
                              if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                      ret = 0;
      
         is obviously from the same opera. But it does not work, because the
         rtmutex is really taken at this point: wake_futex_pi() of previous
         owner reassigned it to us. My fix works. But it looks very stupid.
         I would think about removal of shift of ownership in wake_futex_pi()
         and making all the work in context of process taking lock.
      
      From: Thomas Gleixner <tglx@linutronix.de>
      
      Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
          an additional state transition to the exit code.
      
          This fixes also the issue, when a task with recursive segfaults
          is not able to release the futexes.
      
      Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
          problem finally.
      
      Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
          in the lock protected section by using the in_atomic userspace access
          functions.
      
          This removes also the ugly lock drop / unqueue inside of fixup_pi_state()
      
      Fix 4) Fix a stale lock in the error path of futex_wake_pi()
      
      Added some error checks for verification.
      
      The -EDEADLK problem is solved by the rtmutex fixups.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      778e9a9c
    • Thomas Gleixner's avatar
      rt-mutex: fix chain walk early wakeup bug · 1a539a87
      Thomas Gleixner authored
      
      
      Alexey Kuznetsov found some problems in the pi-futex code.
      
      One of the root causes is:
      
      When a wakeup happens, we do not to stop the chain walk so we follow a not
      longer relevant locking chain.
      
      Drop out when this happens.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a539a87
    • Thomas Gleixner's avatar
      rt-mutex: fix stale return value · c0d1d2bf
      Thomas Gleixner authored
      
      
      Alexey Kuznetsov found some problems in the pi-futex code.
      
      The major problem is a stale return value in rt_mutex_slowlock():
      
      When the pi chain walk returns -EDEADLK, but the waiter was woken up during
      the phases where the locks were dropped, the rtmutex could be acquired, but
      due to the stale return value -EDEADLK returned to the caller.
      
      Reset the return value in the retry path.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0d1d2bf
  7. Jun 07, 2007
    • Roland McGrath's avatar
      Restrict clearing TIF_SIGPENDING · b74d0deb
      Roland McGrath authored
      
      
      This patch should get a few birds.  It prevents sigaction calls from
      clearing TIF_SIGPENDING in other threads, which could leak -ERESTART*.
      And It fixes ptrace_stop not to clear it, which done at the syscall exit
      stop could leak -ERESTART*.  It probably removes the harm from signalfd,
      at least assuming it never calls dequeue_signal on kernel threads that
      might have used block_all_signals.
      
      Signed-off-by: default avatarRoland McGrath <roland@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b74d0deb
  8. Jun 01, 2007
  9. May 30, 2007
  10. May 24, 2007
  11. May 21, 2007
    • Alexey Dobriyan's avatar
      Detach sched.h from mm.h · e8edc6e0
      Alexey Dobriyan authored
      
      
      First thing mm.h does is including sched.h solely for can_do_mlock() inline
      function which has "current" dereference inside. By dealing with can_do_mlock()
      mm.h can be detached from sched.h which is good. See below, why.
      
      This patch
      a) removes unconditional inclusion of sched.h from mm.h
      b) makes can_do_mlock() normal function in mm/mlock.c
      c) exports can_do_mlock() to not break compilation
      d) adds sched.h inclusions back to files that were getting it indirectly.
      e) adds less bloated headers to some files (asm/signal.h, jiffies.h) that were
         getting them indirectly
      
      Net result is:
      a) mm.h users would get less code to open, read, preprocess, parse, ... if
         they don't need sched.h
      b) sched.h stops being dependency for significant number of files:
         on x86_64 allmodconfig touching sched.h results in recompile of 4083 files,
         after patch it's only 3744 (-8.3%).
      
      Cross-compile tested on
      
      	all arm defconfigs, all mips defconfigs, all powerpc defconfigs,
      	alpha alpha-up
      	arm
      	i386 i386-up i386-defconfig i386-allnoconfig
      	ia64 ia64-up
      	m68k
      	mips
      	parisc parisc-up
      	powerpc powerpc-up
      	s390 s390-up
      	sparc sparc-up
      	sparc64 sparc64-up
      	um-x86_64
      	x86_64 x86_64-up x86_64-defconfig x86_64-allnoconfig
      
      as well as my two usual configs.
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8edc6e0
  12. May 17, 2007
  13. May 16, 2007
  14. May 15, 2007
Loading