Skip to content
  1. Oct 02, 2006
    • Eric W. Biederman's avatar
      [PATCH] pid: implement signal functions that take a struct pid * · c4b92fc1
      Eric W. Biederman authored
      
      
      Currently the signal functions all either take a task or a pid_t argument.
      This patch implements variants that take a struct pid *.  After all of the
      users have been update it is my intention to remove the variants that take a
      pid_t as using pid_t can be more work (an extra hash table lookup) and
      difficult to get right in the presence of multiple pid namespaces.
      
      There are two kinds of functions introduced in this patch.  The are the
      general use functions kill_pgrp and kill_pid which take a priv argument that
      is ultimately used to create the appropriate siginfo information, Then there
      are _kill_pgrp_info, kill_pgrp_info, kill_pid_info the internal implementation
      helpers that take an explicit siginfo.
      
      The distinction is made because filling out an explcit siginfo is tricky, and
      will be even more tricky when pid namespaces are introduced.
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c4b92fc1
    • Eric W. Biederman's avatar
      [PATCH] proc: readdir race fix (take 3) · 0804ef4b
      Eric W. Biederman authored
      
      
      The problem: An opendir, readdir, closedir sequence can fail to report
      process ids that are continually in use throughout the sequence of system
      calls.  For this race to trigger the process that proc_pid_readdir stops at
      must exit before readdir is called again.
      
      This can cause ps to fail to report processes, and it is in violation of
      posix guarantees and normal application expectations with respect to
      readdir.
      
      Currently there is no way to work around this problem in user space short
      of providing a gargantuan buffer to user space so the directory read all
      happens in on system call.
      
      This patch implements the normal directory semantics for proc, that
      guarantee that a directory entry that is neither created nor destroyed
      while reading the directory entry will be returned.  For directory that are
      either created or destroyed during the readdir you may or may not see them.
       Furthermore you may seek to a directory offset you have previously seen.
      
      These are the guarantee that ext[23] provides and that posix requires, and
      more importantly that user space expects.  Plus it is a simple semantic to
      implement reliable service.  It is just a matter of calling readdir a
      second time if you are wondering if something new has show up.
      
      These better semantics are implemented by scanning through the pids in
      numerical order and by making the file offset a pid plus a fixed offset.
      
      The pid scan happens on the pid bitmap, which when you look at it is
      remarkably efficient for a brute force algorithm.  Given that a typical
      cache line is 64 bytes and thus covers space for 64*8 == 200 pids.  There
      are only 40 cache lines for the entire 32K pid space.  A typical system
      will have 100 pids or more so this is actually fewer cache lines we have to
      look at to scan a linked list, and the worst case of having to scan the
      entire pid bitmap is pretty reasonable.
      
      If we need something more efficient we can go to a more efficient data
      structure for indexing the pids, but for now what we have should be
      sufficient.
      
      In addition this takes no additional locks and is actually less code than
      what we are doing now.
      
      Also another very subtle bug in this area has been fixed.  It is possible
      to catch a task in the middle of de_thread where a thread is assuming the
      thread of it's thread group leader.  This patch carefully handles that case
      so if we hit it we don't fail to return the pid, that is undergoing the
      de_thread dance.
      
      Thanks to KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> for
      providing the first fix, pointing this out and working on it.
      
      [oleg@tv-sign.ru: fix it]
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Acked-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Jean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0804ef4b
    • Randy Dunlap's avatar
      [PATCH] list module taint flags in Oops/panic · 2bc2d61a
      Randy Dunlap authored
      
      
      When listing loaded modules during an oops or panic, also list each
      module's Tainted flags if non-zero (P: Proprietary or F: Forced load only).
      
      If a module is did not taint the kernel, it is just listed like
      	usbcore
      but if it did taint the kernel, it is listed like
      	wizmodem(PF)
      
      Example:
      [ 3260.121718] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
      [ 3260.121729]  [<ffffffff8804c099>] :dump_test:proc_dump_test+0x99/0xc8
      [ 3260.121742] PGD fe8d067 PUD 264a6067 PMD 0
      [ 3260.121748] Oops: 0002 [1] SMP
      [ 3260.121753] CPU 1
      [ 3260.121756] Modules linked in: dump_test(P) snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ide_cd generic ohci1394 snd_hda_intel snd_hda_codec snd_pcm snd_timer snd ieee1394 snd_page_alloc piix ide_core arcmsr aic79xx scsi_transport_spi usblp
      [ 3260.121785] Pid: 5556, comm: bash Tainted: P      2.6.18-git10 #1
      
      [Alternatively, I can look into listing tainted flags with 'lsmod',
      but that won't help in oopsen/panics so much.]
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: default avatarRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2bc2d61a
  2. Oct 01, 2006
    • Andi Kleen's avatar
      [PATCH] Support piping into commands in /proc/sys/kernel/core_pattern · d025c9db
      Andi Kleen authored
      
      
      Using the infrastructure created in previous patches implement support to
      pipe core dumps into programs.
      
      This is done by overloading the existing core_pattern sysctl
      with a new syntax:
      
      |program
      
      When the first character of the pattern is a '|' the kernel will instead
      threat the rest of the pattern as a command to run.  The core dump will be
      written to the standard input of that program instead of to a file.
      
      This is useful for having automatic core dump analysis without filling up
      disks.  The program can do some simple analysis and save only a summary of
      the core dump.
      
      The core dump proces will run with the privileges and in the name space of
      the process that caused the core dump.
      
      I also increased the core pattern size to 128 bytes so that longer command
      lines fit.
      
      Most of the changes comes from allowing core dumps without seeks.  They are
      fairly straight forward though.
      
      One small incompatibility is that if someone had a core pattern previously
      that started with '|' they will get suddenly new behaviour.  I think that's
      unlikely to be a real problem though.
      
      Additional background:
      
      > Very nice, do you happen to have a program that can accept this kind of
      > input for crash dumps?  I'm guessing that the embedded people will
      > really want this functionality.
      
      I had a cheesy demo/prototype.  Basically it wrote the dump to a file again,
      ran gdb on it to get a backtrace and wrote the summary to a shared directory.
      Then there was a simple CGI script to generate a "top 10" crashes HTML
      listing.
      
      Unfortunately this still had the disadvantage to needing full disk space for a
      dump except for deleting it afterwards (in fact it was worse because over the
      pipe holes didn't work so if you have a holey address map it would require
      more space).
      
      Fortunately gdb seems to be happy to handle /proc/pid/fd/xxx input pipes as
      cores (at least it worked with zsh's =(cat core) syntax), so it would be
      likely possible to do it without temporary space with a simple wrapper that
      calls it in the right way.  I ran out of time before doing that though.
      
      The demo prototype scripts weren't very good.  If there is really interest I
      can dig them out (they are currently on a laptop disk on the desk with the
      laptop itself being in service), but I would recommend to rewrite them for any
      serious application of this and fix the disk space problem.
      
      Also to be really useful it should probably find a way to automatically fetch
      the debuginfos (I cheated and just installed them in advance).  If nobody else
      does it I can probably do the rewrite myself again at some point.
      
      My hope at some point was that desktops would support it in their builtin
      crash reporters, but at least the KDE people I talked too seemed to be happy
      with their user space only solution.
      
      Alan sayeth:
      
        I don't believe that piping as such as neccessarily the right model, but
        the ability to intercept and processes core dumps from user space is asked
        for by many enterprise users as well.  They want to know about, capture,
        analyse and process core dumps, often centrally and in automated form.
      
      [akpm@osdl.org: loff_t != unsigned long]
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d025c9db
    • Andi Kleen's avatar
      [PATCH] Create call_usermodehelper_pipe() · e239ca54
      Andi Kleen authored
      
      
      A new member in the ever growing family of call_usermode* functions is
      born.  The new call_usermodehelper_pipe() function allows to pipe data to
      the stdin of the called user mode progam and behaves otherwise like the
      normal call_usermodehelp() (except that it always waits for the child to
      finish)
      
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e239ca54
    • Dave Hansen's avatar
      [PATCH] r/o bind mount prepwork: inc_nlink() helper · d8c76e6f
      Dave Hansen authored
      
      
      This is mostly included for parity with dec_nlink(), where we will have some
      more hooks.  This one should stay pretty darn straightforward for now.
      
      Signed-off-by: default avatarDave Hansen <haveblue@us.ibm.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      d8c76e6f
    • Jay Lan's avatar
      [PATCH] csa accounting taskstats update · db5fed26
      Jay Lan authored
      
      
      ChangeLog:
         Feedbacks from Andrew Morton:
         - define TS_COMM_LEN to 32
         - change acct_stimexpd field of task_struct to be of
           cputime_t, which is to be used to save the tsk->stime
           of last timer interrupt update.
         - a new Documentation/accounting/taskstats-struct.txt
           to describe fields of taskstats struct.
      
         Feedback from Balbir Singh:
         - keep the stime of a task to be zero when both stime
           and utime are zero as recoreded in task_struct.
      
         Misc:
         - convert accumulated RSS/VM from platform dependent
           pages-ticks to MBytes-usecs in the kernel
      
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jes Sorensen <jes@sgi.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      db5fed26
    • Jay Lan's avatar
      [PATCH] csa: convert CONFIG tag for extended accounting routines · 8f0ab514
      Jay Lan authored
      
      
      There were a few accounting data/macros that are used in CSA but are #ifdef'ed
      inside CONFIG_BSD_PROCESS_ACCT.  This patch is to change those ifdef's from
      CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT.  A few defines are moved from
      kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
      include/linux/tsacct_kern.h.
      
      Signed-off-by: default avatarJay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jes Sorensen <jes@sgi.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f0ab514
    • Jay Lan's avatar
      [PATCH] csa: Extended system accounting over taskstats · 9acc1853
      Jay Lan authored
      
      
      Add extended system accounting handling over taskstats interface.  A
      CONFIG_TASK_XACCT flag is created to enable the extended accounting code.
      
      Signed-off-by: default avatarJay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jes Sorensen <jes@sgi.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      9acc1853
    • Jay Lan's avatar
      [PATCH] csa: basic accounting over taskstats · f3cef7a9
      Jay Lan authored
      
      
      Add some basic accounting fields to the taskstats struct, add a new
      kernel/tsacct.c to handle basic accounting data handling upon exit.  A handle
      is added to taskstats.c to invoke the basic accounting data handling.
      
      Signed-off-by: default avatarJay Lan <jlan@sgi.com>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Jes Sorensen <jes@sgi.com>
      Cc: Chris Sturtivant <csturtiv@sgi.com>
      Cc: Tony Ernst <tee@sgi.com>
      Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
      Cc: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f3cef7a9
    • Balbir Singh's avatar
      [PATCH] Fix taskstats size calculation (use the new genetlink utility functions) · 0ae64684
      Balbir Singh authored
      
      
      The addition of the CSA patch pushed the size of struct taskstats to 256
      bytes.  This exposed a problem with prepare_reply(), we were not allocating
      space for the netlink and genetlink header.  It worked earlier because
      alloc_skb() would align the skb to SMP_CACHE_BYTES, which added some additonal
      bytes.
      
      Signed-off-by: default avatarBalbir Singh <balbir@in.ibm.com>
      Cc: Jamal Hadi <hadi@cyberus.ca>
      Cc: Shailabh Nagar <nagar@watson.ibm.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jay Lan <jlan@engr.sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      0ae64684
    • Atsushi Nemoto's avatar
      [PATCH] kill wall_jiffies · 8ef38609
      Atsushi Nemoto authored
      
      
      With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies.
      So we can kill wall_jiffies completely.
      
      This is just a cleanup and logically should not change any real behavior
      except for one thing: RTC updating code in (old) ppc and xtensa use a
      condition "jiffies - wall_jiffies == 1".  This condition is never met so I
      suppose it is just a bug.  I just remove that condition only instead of
      kill the whole "if" block.
      
      [heiko.carstens@de.ibm.com: s390 build fix and cleanup]
      Signed-off-by: default avatarAtsushi Nemoto <anemo@mba.ocn.ne.jp>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8ef38609
    • Adrian Bunk's avatar
      [PATCH] kernel/time/ntp.c: possible cleanups · 70bc42f9
      Adrian Bunk authored
      
      
      This patch contains the following possible cleanups:
      - make the following needlessly global function static:
        - ntp_update_frequency()
      - make the following needlessly global variables static:
        - time_state
        - time_offset
        - time_constant
        - time_reftime
      - remove the following read-only global variable:
        - time_precision
      
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      70bc42f9
    • Roman Zippel's avatar
      [PATCH] ntp: convert to the NTP4 reference model · f1992393
      Roman Zippel authored
      
      
      This converts the kernel ntp model into a model which matches the nanokernel
      reference implementations.  The previous patches already increased the
      resolution and precision of the computations, so that this conversion becomes
      quite simple.
      
      <linux@horizon.com> explains:
      
      The original NTP kernel interface was defined in units of microseconds.
      That's what Linux implements.  As computers have gotten faster and can now
      split microseconds easily, a new kernel interface using nanosecond units was
      defined ("the nanokernel", confusing as that name is to OS hackers), and
      there's an STA_NANO bit in the adjtimex() status field to tell the application
      which units it's using.
      
      The current ntpd supports both, but Linux loses some possible timing
      resolution because of quantization effects, and the ntpd hackers would really
      like to be able to drop the backwards compatibility code.
      
      Ulrich Windl has been maintaining a patch set to do the conversion for years,
      but it's hard to keep in sync.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      f1992393
    • Roman Zippel's avatar
      [PATCH] ntp: convert time_freq to nsec value · 04b617e7
      Roman Zippel authored
      
      
      This converts time_freq to a scaled nsec value and adds around 6bit of extra
      resolution.  This pushes the time_freq to its 32bit limits so the calculatons
      have to be done with 64bit.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      04b617e7
    • Roman Zippel's avatar
      [PATCH] ntp: remove time_tolerance · 97eebe13
      Roman Zippel authored
      
      
      time_tolerance isn't changed at all in the kernel, so simply remove it, this
      simplifies the next patch, as it avoids a number of conversions.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      97eebe13
    • Roman Zippel's avatar
      [PATCH] ntp: add time_adjust to tick length · 8f807f8d
      Roman Zippel authored
      
      
      This folds update_ntp_one_tick() into second_overflow() and adds time_adjust
      to the tick length, this makes time_next_adjust unnecessary.  This slightly
      changes the adjtime() behaviour, instead of applying it to the next tick, it's
      applied to the next second.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      8f807f8d
    • Roman Zippel's avatar
      [PATCH] ntp: prescale time_offset · 3d3675cc
      Roman Zippel authored
      
      
      This converts time_offset into a scaled per tick value.  This avoids now
      completely the crude compensation in second_overflow().
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      3d3675cc
    • Roman Zippel's avatar
      [PATCH] ntp: add time_freq to tick length · dc6a43e4
      Roman Zippel authored
      
      
      This adds the frequency part to ntp_update_frequency().
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dc6a43e4
    • Roman Zippel's avatar
      [PATCH] ntp: add time_adj to tick length · ab8783b6
      Roman Zippel authored
      
      
      This makes time_adj local to second_overflow() and integrates it into the tick
      length instead of adding it everytime.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ab8783b6
    • Roman Zippel's avatar
      [PATCH] ntp: add ntp_update_frequency · b0ee7556
      Roman Zippel authored
      
      
      This introduces ntp_update_frequency() and deinlines ntp_clear() (as it's not
      performance critical).  ntp_update_frequency() calculates the base tick length
      using tick_usec and adds a base adjustment, in case the frequency doesn't
      divide evenly by HZ.
      
      Signed-off-by: default avatarRoman Zippel <zippel@linux-m68k.org>
      Cc: john stultz <johnstul@us.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b0ee7556
    • John Stultz's avatar
      [PATCH] NTP: Move all the NTP related code to ntp.c · 4c7ee8de
      John Stultz authored
      
      
      Move all the NTP related code to ntp.c
      
      [akpm@osdl.org: cleanups, build fix]
      Signed-off-by: default avatarJohn Stultz <johnstul@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c7ee8de
    • Martin Schwidefsky's avatar
      [PATCH] Directed yield: cpu_relax variants for spinlocks and rw-locks · ef6edc97
      Martin Schwidefsky authored
      
      
      On systems running with virtual cpus there is optimization potential in
      regard to spinlocks and rw-locks.  If the virtual cpu that has taken a lock
      is known to a cpu that wants to acquire the same lock it is beneficial to
      yield the timeslice of the virtual cpu in favour of the cpu that has the
      lock (directed yield).
      
      With CONFIG_PREEMPT="n" this can be implemented by the architecture without
      common code changes.  Powerpc already does this.
      
      With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock,
      _raw_read_trylock and _raw_write_trylock in kernel/spinlock.c.  If the lock
      could not be taken cpu_relax is called.  A directed yield is not possible
      because cpu_relax doesn't know anything about the lock.  To be able to
      yield the lock in favour of the current lock holder variants of cpu_relax
      for spinlocks and rw-locks are needed.  The new _raw_spin_relax,
      _raw_read_relax and _raw_write_relax primitives differ from cpu_relax
      insofar that they have an argument: a pointer to the lock structure.
      
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ef6edc97
    • Cal Peake's avatar
      [PATCH] CodingStyle cleanup for kernel/sys.c · 756184b7
      Cal Peake authored
      
      
      Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
      file.
      
      Signed-off-by: default avatarCal Peake <cp@absolutedigital.net>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      756184b7
    • Arjan van de Ven's avatar
      [PATCH] maximum latency tracking infrastructure · 5c87579e
      Arjan van de Ven authored
      
      
      Add infrastructure to track "maximum allowable latency" for power saving
      policies.
      
      The reason for adding this infrastructure is that power management in the
      idle loop needs to make a tradeoff between latency and power savings
      (deeper power save modes have a longer latency to running code again).  The
      code that today makes this tradeoff just does a rather simple algorithm;
      however this is not good enough: There are devices and use cases where a
      lower latency is required than that the higher power saving states provide.
       An example would be audio playback, but another example is the ipw2100
      wireless driver that right now has a very direct and ugly acpi hook to
      disable some higher power states randomly when it gets certain types of
      error.
      
      The proposed solution is to have an interface where drivers can
      
      * announce the maximum latency (in microseconds) that they can deal with
      * modify this latency
      * give up their constraint
      
      and a function where the code that decides on power saving strategy can
      query the current global desired maximum.
      
      This patch has a user of each side: on the consumer side, ACPI is patched
      to use this, on the producer side the ipw2100 driver is patched.
      
      A generic maximum latency is also registered of 2 timer ticks (more and you
      lose accurate time tracking after all).
      
      While the existing users of the patch are x86 specific, the infrastructure
      is not.  I'd like to ask the arch maintainers of other architectures if the
      infrastructure is generic enough for their use (assuming the architecture
      has such a tradeoff as concept at all), and the sound/multimedia driver
      owners to look at the driver facing API to see if this is something they
      can use.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarJesse Barnes <jesse.barnes@intel.com>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      5c87579e
  3. Sep 30, 2006
    • David Howells's avatar
      [PATCH] BLOCK: Make it possible to disable the block layer [try #6] · 9361401e
      David Howells authored
      
      
      Make it possible to disable the block layer.  Not all embedded devices require
      it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
      the block layer to be present.
      
      This patch does the following:
      
       (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
           support.
      
       (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
           an item that uses the block layer.  This includes:
      
           (*) Block I/O tracing.
      
           (*) Disk partition code.
      
           (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
      
           (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
           	 block layer to do scheduling.  Some drivers that use SCSI facilities -
           	 such as USB storage - end up disabled indirectly from this.
      
           (*) Various block-based device drivers, such as IDE and the old CDROM
           	 drivers.
      
           (*) MTD blockdev handling and FTL.
      
           (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
           	 taking a leaf out of JFFS2's book.
      
       (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
           linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
           however, still used in places, and so is still available.
      
       (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
           parts of linux/fs.h.
      
       (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
      
       (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
      
       (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
           is not enabled.
      
       (*) fs/no-block.c is created to hold out-of-line stubs and things that are
           required when CONFIG_BLOCK is not set:
      
           (*) Default blockdev file operations (to give error ENODEV on opening).
      
       (*) Makes some /proc changes:
      
           (*) /proc/devices does not list any blockdevs.
      
           (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
      
       (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
      
       (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
           given command other than Q_SYNC or if a special device is specified.
      
       (*) In init/do_mounts.c, no reference is made to the blockdev routines if
           CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.
      
       (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
           error ENOSYS by way of cond_syscall if so).
      
       (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
           CONFIG_BLOCK is not set, since they can't then happen.
      
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9361401e
    • David Howells's avatar
      [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6] · 07f3f05c
      David Howells authored
      
      
      Create a new header file, fs/internal.h, for common definitions local to the
      sources in the fs/ directory.
      
      Move extern definitions that should be in header files from fs/*.c to
      fs/internal.h or other main header files where they span directories.
      
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      07f3f05c
    • David Howells's avatar
      [PATCH] BLOCK: Remove duplicate declaration of exit_io_context() [try #6] · 0d67a46d
      David Howells authored
      
      
      Remove the duplicate declaration of exit_io_context() from linux/sched.h.
      
      Signed-Off-By: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0d67a46d
  4. Sep 29, 2006
    • Andi Kleen's avatar
    • Andi Kleen's avatar
      [PATCH] x86: Clean up x86 NMI sysctls · 29cbc78b
      Andi Kleen authored
      
      
      Use prototypes in headers
      Don't define panic_on_unrecovered_nmi for all architectures
      
      Cc: dzickus@redhat.com
      
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      29cbc78b
    • Paul Jackson's avatar
      [PATCH] cpuset: fix obscure attach_task vs exiting race · 181b6480
      Paul Jackson authored
      
      
      Fix obscure race condition in kernel/cpuset.c attach_task() code.
      
      There is basically zero chance of anyone accidentally being harmed by this
      race.
      
      It requires a special 'micro-stress' load and a special timing loop hacks
      in the kernel to hit in less than an hour, and even then you'd have to hit
      it hundreds or thousands of times, followed by some unusual and senseless
      cpuset configuration requests, including removing the top cpuset, to cause
      any visibly harm affects.
      
      One could, with perhaps a few days or weeks of such effort, get the
      reference count on the top cpuset below zero, and manage to crash the
      kernel by asking to remove the top cpuset.
      
      I found it by code inspection.
      
      The race was introduced when 'the_top_cpuset_hack' was introduced, and one
      piece of code was not updated.  An old check for a possibly null task
      cpuset pointer needed to be changed to a check for a task marked
      PF_EXITING.  The pointer can't be null anymore, thanks to
      the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
      have gone into PF_EXITING state after it was found in the task_list scan.
      
      If a task is PF_EXITING in this code, it is possible that its task->cpuset
      pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
      than because the top_cpuset was that tasks last valid cpuset.  In that
      case, the wrong cpuset reference counter would be decremented.
      
      The fix is trivial.  Instead of failing the system call if the tasks cpuset
      pointer is null here, fail it if the task is in PF_EXITING state.
      
      The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
      the top_cpuset is done without locking, so could happen at anytime.  But it
      is done during the exit handling, after the PF_EXITING flag is set.  So if
      we verify that a task is still not PF_EXITING after we copy out its cpuset
      pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
      hack references to the top_cpuset.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      181b6480
    • Ingo Molnar's avatar
      [PATCH] lockdep core: improve the lock-chain-hash · 03cbc358
      Ingo Molnar authored
      
      
      With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
      the locking self-test:
      
        ------------>
        | Locking API testsuite:
        ----------------------------------------------------------------------------
                                         | spin |wlock |rlock |mutex | wsem | rsem |
          --------------------------------------------------------------------------
                             A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                         A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                     A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                     A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 A-B-B-C-C-D-D-A deadlock:  ok  |FAILED|  ok  |  ok  |  ok  |  ok  |
                 A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                 A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |FAILED|
      
      after much debugging it turned out to be caused by accidental chain-hash
      key collisions.  The current hash is:
      
       #define iterate_chain_key(key1, key2) \
      	(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
      	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
       	(key2))
      
      where MAX_LOCKDEP_KEYS_BITS is 11.  This hash is pretty good as it will
      shift by 5 bits in every iteration, where every new ID 'mixed' into the
      hash would have up to 11 bits.  But because there was a 6 bits overlap
      between subsequent IDs and their high bits tended to be similar, there was
      a chance for accidental chain-hash collision for a low number of locks
      held.
      
      the solution is to shift by 11 bits:
      
       #define iterate_chain_key(key1, key2) \
      	(((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
      	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
       	(key2))
      
      This keeps the hash perfect up to 5 locks held, but even above that the
      hash is still good because 11 bits is a relative prime to the total 64
      bits, so a complete match will only occur after 64 held locks (which doesnt
      happen in Linux).  Even after 5 locks held, entropy of the 5 IDs mixed into
      the hash is already good enough so that overlap doesnt generate a colliding
      hash ID.
      
      with this change the false positives went away.
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      03cbc358
    • Alan Cox's avatar
      [PATCH] audit/accounting: tty locking · eb84a20e
      Alan Cox authored
      
      
      Add tty locking around the audit and accounting code.
      
      The whole current->signal-> locking is all deeply strange but it's for
      someone else to sort out.  Add rather than replace the lock for acct.c
      
      Signed-off-by: default avatarAlan Cox <alan@redhat.com>
      Acked-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      eb84a20e
    • Rusty Russell's avatar
      [PATCH] stop_machine.c copyright · e5582ca2
      Rusty Russell authored
      
      
      I had to look back: this code was extracted from the module.c code in 2005.
      
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      e5582ca2
    • Ian S. Nelson's avatar
      [PATCH] /sys/modules: allow full length section names · 04b1db9f
      Ian S. Nelson authored
      
      
      I've been using systemtap for some debugging and I noticed that it can't
      probe a lot of modules.  Turns out it's kind of silly, the sections section
      of /sys/module is limited to 32byte filenames and many of the actual
      sections are a a bit longer than that.
      
      [akpm@osdl.org: rewrite to use dymanic allocation]
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      04b1db9f
    • Paul Jackson's avatar
      [PATCH] cpuset: hotunplug cpus and mems in all cpusets · b1aac8bb
      Paul Jackson authored
      
      
      The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
      it could remove a CPU or Node from the top cpuset, while leaving it still
      in some child cpusets.
      
      One basic rule of cpusets is that each cpusets cpus and mems are subsets of
      its parents.  The cpuset hot unplug code violated this rule.
      
      So the cpuset hotunplug handler must walk down the tree, removing any
      removed CPU or Node from all cpusets.
      
      However, it is not allowed to make a cpusets cpus or mems become empty.
      They can only transition from empty to non-empty, not back.
      
      So if the last CPU or Node would be removed from a cpuset by the above
      walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
      that still has something online, and copy its CPU or Memory placement.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b1aac8bb
    • Paul Jackson's avatar
      [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map · 38837fc7
      Paul Jackson authored
      
      
      Change the list of memory nodes allowed to tasks in the top (root) nodeset
      to dynamically track what cpus are online, using a call to a cpuset hook
      from the memory hotplug code.  Make this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support memory hotplug, then these tasks cannot make
      use of memory nodes that are added after system boot, because the memory
      nodes are not allowed in the top cpuset.  This is a surprising regression
      over earlier kernels that didn't have cpusets enabled.
      
      One key motivation for this change is to remain consistent with the
      behaviour for the top_cpuset's 'cpus', which is also read-only, and which
      automatically tracks the cpu_online_map.
      
      This change also has the minor benefit that it fixes a long standing,
      little noticed, minor bug in cpusets.  The cpuset performance tweak to
      short circuit the cpuset_zone_allowed() check on systems with just a single
      cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
      changing the 'mems' of the top_cpuset had no affect, even though the change
      (the write system call) appeared to succeed.  With the following change,
      that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
      refuses to be changed via user space writes.  Thus no one should be mislead
      into thinking they've changed the top_cpusets's 'mems' when in affect they
      haven't.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'mems' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of node_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged memory nodes
      allowed by their cpuset.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Signed-off-by: default avatarAdrian Bunk <bunk@stusta.de>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      38837fc7
    • Oleg Nesterov's avatar
      [PATCH] introduce TASK_DEAD state · c394cc9f
      Oleg Nesterov authored
      
      
      I am not sure about this patch, I am asking Ingo to take a decision.
      
      task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
      it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
      live only in ->exit_state as documented in sched.h.
      
      Note that this state is not visible to user-space, get_task_state() masks off
      unsuitable states.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c394cc9f
    • Oleg Nesterov's avatar
      [PATCH] kill PF_DEAD flag · 55a101f8
      Oleg Nesterov authored
      
      
      After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we
      don't need PF_DEAD any longer.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      55a101f8
    • Oleg Nesterov's avatar
      [PATCH] set EXIT_DEAD state in do_exit(), not in schedule() · 29b88492
      Oleg Nesterov authored
      
      
      schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
      to ensure that the exiting task will be deactivated.  Note that this EXIT_DEAD
      is in fact a "random" value, we can use any bit except normal TASK_XXX values.
      
      It is better to set this state in do_exit() along with PF_DEAD flag and remove
      that check in schedule().
      
      We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
      can not change task's ->state: the 'state' argument of try_to_wake_up() can't
      have EXIT_DEAD bit.  And in case when try_to_wake_up() sees a stale value of
      ->state == TASK_RUNNING it will do nothing.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      29b88492
Loading