Skip to content
  1. Jun 22, 2005
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] uml: make hw_controller_type->release exist only for archs needing it · b77d6adc
      Paolo 'Blaisorblade' Giarrusso authored
      
      
      With Chris Wedgwood <cw@f00f.org>
      
      As suggested by Chris, we can make the "just added" method ->release
      conditional to UML only (better: to archs requesting it, i.e.  only UML
      currently), so that other archs don't get this unneeded crud, and if UML
      won't need it any more we can kill this.
      
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b77d6adc
    • Paolo 'Blaisorblade' Giarrusso's avatar
      [PATCH] uml: add and use generic hw_controller_type->release · dbce706e
      Paolo 'Blaisorblade' Giarrusso authored
      
      
      With Chris Wedgwood <cw@f00f.org>
      
      Currently UML must explicitly call the UML-specific
      free_irq_by_irq_and_dev() for each free_irq call it's done.
      
      This is needed because ->shutdown and/or ->disable are only called when the
      last "action" for that irq is removed.
      
      Instead, for UML shared IRQs (UML IRQs are very often, if not always,
      shared), for each dev_id some setup is done, which must be cleared on the
      release of that fd.  For instance, for each open console a new instance
      (i.e.  new dev_id) of the same IRQ is requested().
      
      Exactly, a fd is stored in an array (pollfds), which is after read by a
      host thread and passed to poll().  Each event registered by poll() triggers
      an interrupt.  So, for each free_irq() we must remove the corresponding
      host fd from the table, which we do via this -release() method.
      
      In this patch we add an appropriate hook for this, and remove all uses of
      it by pointing the hook to the said procedure; this is safe to do since the
      said procedure.
      
      Also some cosmetic improvements are included.
      
      This is heavily based on some work by Chris Wedgwood, which however didn't
      get the patch merged for something I'd call a "misunderstanding" (the need
      for this patch wasn't cleanly explained, thus adding the generic hook was
      felt as undesirable).
      
      Signed-off-by: default avatarPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      dbce706e
    • Hugh Dickins's avatar
      [PATCH] dup_mmap: update comment on new vma · 45918e1a
      Hugh Dickins authored
      
      
      Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
      came in, try_to_unmap_one knows the vma without needing find_vma.  But add
      a comment to note that here vma is inserted without mmap_sem.
      
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      45918e1a
    • Wolfgang Wander's avatar
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander authored
      
      
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      
      Signed-off-by: default avatarWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: default avatarKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1363c3cd
    • Martin Hicks's avatar
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks authored
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.c
      
      
      
      Signed-off-by: default avatarMartin Hicks <mort@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      753ee728
    • Ingo Molnar's avatar
      [PATCH] smp_processor_id() cleanup · 39c715b7
      Ingo Molnar authored
      
      
      This patch implements a number of smp_processor_id() cleanup ideas that
      Arjan van de Ven and I came up with.
      
      The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
      spaghetti was hard to follow both on the implementational and on the
      usage side.
      
      Some of the complexity arose from picking wrong names, some of the
      complexity comes from the fact that not all architectures defined
      __smp_processor_id.
      
      In the new code, there are two externally visible symbols:
      
       - smp_processor_id(): debug variant.
      
       - raw_smp_processor_id(): nondebug variant. Replaces all existing
         uses of _smp_processor_id() and __smp_processor_id(). Defined
         by every SMP architecture in include/asm-*/smp.h.
      
      There is one new internal symbol, dependent on DEBUG_PREEMPT:
      
       - debug_smp_processor_id(): internal debug variant, mapped to
                                   smp_processor_id().
      
      Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
      lib/smp_processor_id.c file.  All related comments got updated and/or
      clarified.
      
      I have build/boot tested the following 8 .config combinations on x86:
      
       {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
      
      I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
      architectures are untested, but should work just fine.)
      
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarArjan van de Ven <arjan@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      39c715b7
  2. Jun 20, 2005
  3. Jun 17, 2005
  4. Jun 14, 2005
  5. May 31, 2005
    • Roman Zippel's avatar
      [PATCH] flush icache in correct context · ae92ef8a
      Roman Zippel authored
      
      
      flush_icache_range() is used in two different situation - in binfmt_elf.c &
      co for user space mappings and module.c for kernel modules.  On m68k
      flush_icache_range() doesn't know which data to flush, as it has separate
      address spaces and the pointer argument can be valid in either address
      space.
      
      First I considered splitting flush_icache_range(), but this patch is
      simpler.  Setting the correct context gives flush_icache_range() enough
      information to flush the correct data.
      
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      ae92ef8a
  6. May 28, 2005
    • John Hawkes's avatar
      [PATCH] drop note_interrupt() for per-CPU for proper scaling · b60c1f6f
      John Hawkes authored
      
      
      The "unhandled interrupts" catcher, note_interrupt(), increments a global
      desc->irq_count and grossly damages scaling of very large systems, e.g.,
      >192p ia64 Altix, because of this highly contented cacheline, especially
      for timer interrupts.  384p is severely crippled, and 512p is unuseable.
      
      All calls to note_interrupt() can be disabled by booting with "noirqdebug",
      but this disables the useful interrupt checking for all interrupts.
      
      I propose eliminating note_interrupt() for all per-CPU interrupts.  This
      was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
      restructuring added a call to note_interrupt() for per-CPU interrupts.
      Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
      the desc->irq_count++ increment isn't atomic (which, if done, would make
      scaling even worse).
      
      Signed-off-by: default avatarJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      b60c1f6f
  7. May 27, 2005
    • Paul Jackson's avatar
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson authored
      
      
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      
      Signed-off-by: default avatarPaul Jackson <pj@sgi.com>
      Acked-by: default avatarSimon Derr <simon.derr@bull.net>
      Acked-by: default avatarDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      2efe86b8
    • David Woodhouse's avatar
  8. May 26, 2005
    • David Woodhouse's avatar
      AUDIT: Defer freeing aux items until audit_free_context() · 7551ced3
      David Woodhouse authored
      
      
      While they were all just simple blobs it made sense to just free them
      as we walked through and logged them. Now that there are pointers to
      other objects which need refcounting, we might as well revert to
      _only_ logging them in audit_log_exit(), and put the code to free them
      properly in only one place -- in audit_free_aux().
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      ----------------------------------------------------------
      7551ced3
  9. May 25, 2005
  10. May 23, 2005
  11. May 21, 2005
    • David Woodhouse's avatar
      AUDIT: Assign serial number to non-syscall messages · bfb4496e
      David Woodhouse authored
      
      
      Move audit_serial() into audit.c and use it to generate serial numbers 
      on messages even when there is no audit context from syscall auditing.  
      This allows us to disambiguate audit records when more than one is 
      generated in the same millisecond.
      
      Based on a patch by Steve Grubb after he observed the problem.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      bfb4496e
    • Samuel Thibault's avatar
      [PATCH] spin_unlock_bh() and preempt_check_resched() · 10f02d1c
      Samuel Thibault authored
      
      
      In _spin_unlock_bh(lock):
      	do { \
      		_raw_spin_unlock(lock); \
      		preempt_enable(); \
      		local_bh_enable(); \
      		__release(lock); \
      	} while (0)
      
      there is no reason for using preempt_enable() instead of a simple
      preempt_enable_no_resched()
      
      Since we know bottom halves are disabled, preempt_schedule() will always
      return at once (preempt_count!=0), and hence preempt_check_resched() is
      useless here...
      
      This fixes it by using "preempt_enable_no_resched()" instead of the
      "preempt_enable()", and thus avoids the useless preempt_check_resched()
      just before re-enabling bottom halves.
      
      Signed-off-by: default avatarSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      10f02d1c
  12. May 20, 2005
  13. May 19, 2005
    • David Woodhouse's avatar
      AUDIT: Honour audit_backlog_limit again. · fb19b4c6
      David Woodhouse authored
      
      
      The limit on the number of outstanding audit messages was inadvertently
      removed with the switch to queuing skbs directly for sending by a kernel
      thread. Put it back again.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      fb19b4c6
    • David Woodhouse's avatar
      AUDIT: Quis Custodiet Ipsos Custodes? · 7ca00264
      David Woodhouse authored
      
      
      Nobody does. Really, it gets very silly if auditd is recording its
      own actions.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      7ca00264
    • David Woodhouse's avatar
      AUDIT: Send netlink messages from a separate kernel thread · b7d11258
      David Woodhouse authored
      
      
      netlink_unicast() will attempt to reallocate and will free messages if
      the socket's rcvbuf limit is reached unless we give it an infinite 
      timeout. So do that, from a kernel thread which is dedicated to spewing
      stuff up the netlink socket.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      b7d11258
    • Steve Grubb's avatar
      AUDIT: Clean up logging of untrusted strings · 168b7173
      Steve Grubb authored
      
      
      * If vsnprintf returns -1, it will mess up the sk buffer space accounting. 
      This is fixed by not calling skb_put with bogus len values.
      
      * audit_log_hex was a loop that called audit_log_vformat with %02X for each 
      character. This is very inefficient since conversion from unsigned character 
      to Ascii representation is essentially masking, shifting, and byte lookups. 
      Also, the length of the converted string is well known - it's twice the 
      original. Fixed by rewriting the function.
      
      * audit_log_untrustedstring had no comments. This makes it hard for 
      someone to understand what the string format will be.
      
      * audit_log_d_path was never fixed to use untrustedstring. This could mess
      up user space parsers. This was fixed to make a temp buffer, call d_path, 
      and log temp buffer using untrustedstring. 
      
      From: Steve Grubb <sgrubb@redhat.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw2@infradead.org>
      168b7173
  14. May 18, 2005
  15. May 17, 2005
  16. May 13, 2005
  17. May 11, 2005
Loading