Skip to content
  1. Jan 30, 2009
    • Linus Torvalds's avatar
      Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments · de33c8db
      Linus Torvalds authored
      
      
      As of commit ba470de4 ("map: handle
      mlocked pages during map, remap, unmap") we now use the 'vma' variable
      at the end of mmap_region() to handle the page-in of newly mapped
      mlocked pages.
      
      However, if we merged adjacent vma's together, the vma we're using may
      be stale.  We historically consciously avoided using it after the merge
      operation, but that got overlooked when redoing the locked page
      handling.
      
      This commit simplifies mmap_region() by doing any vma merges early,
      avoiding the issue entirely, and 'vma' will always be valid.  As pointed
      out by Hugh Dickins, this depends on any drivers that change the page
      offset of flags to have set one of the VM_SPECIAL bits (so that they
      cannot trigger the early merge logic), but that's true in general.
      
      Reported-and-tested-by: default avatarMaksim Yevmenkin <maksim.yevmenkin@gmail.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de33c8db
  2. Jan 14, 2009
  3. Jan 08, 2009
    • David Howells's avatar
      NOMMU: Make VMAs per MM as for MMU-mode linux · 8feae131
      David Howells authored
      
      
      Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:
      
       (1) In SYSV SHM where nattch for a segment does not reflect the number of
           shmat's (and forks) done.
      
       (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
           exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
           that a VMA might be shared and already have its vm_mm assigned to another
           process or a dead process.
      
      A new struct (vm_region) is introduced to track a mapped region and to remember
      the circumstances under which it may be shared and the vm_list_struct structure
      is discarded as it's no longer required.
      
      This patch makes the following additional changes:
      
       (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
           with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
           each page has a reference on it held by the region.  Anything else that is
           interested in such a page will have to get a reference on it to retain it.
           When the pages are released due to unmapping, each page is passed to
           put_page() and will be freed when the page usage count reaches zero.
      
       (2) Excess pages are trimmed after an allocation as the allocation must be
           made as a power-of-2 quantity of pages.
      
       (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
           end up with overlapping VMAs within the tree, the VMA struct address is
           appended to the sort key.
      
       (4) Non-anonymous VMAs are now added to the backing inode's prio list.
      
       (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
           the backing region.  The VMA and region structs will be split if
           necessary.
      
       (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
           segment instead of all the attachments at that addresss.  Multiple
           shmat()'s return the same address under NOMMU-mode instead of different
           virtual addresses as under MMU-mode.
      
       (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
      
       (8) /proc/maps is now the global list of mapped regions, and may list bits
           that aren't actually mapped anywhere.
      
       (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
           of RAM currently allocated by mmap to hold mappable regions that can't be
           mapped directly.  These are copies of the backing device or file if not
           anonymous.
      
      These changes make NOMMU mode more similar to MMU mode.  The downside is that
      NOMMU mode requires some extra memory to track things over NOMMU without this
      patch (VMAs are no longer shared, and there are now region structs).
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMike Frysinger <vapier.adi@gmail.com>
      Acked-by: default avatarPaul Mundt <lethal@linux-sh.org>
      8feae131
  4. Jan 06, 2009
  5. Nov 12, 2008
    • Denys Vlasenko's avatar
      parisc: fix find_extend_vma() breakage · 1c127185
      Denys Vlasenko authored
      
      
      The STACK_GROWSUP case of stack expansion was missing a test for 'prev',
      which got removed by commit cb8f488c
      ("mmap.c: deinline a few functions") by mistake.
      
      I found my original email in "sent" folder. The patch in that mail
      does NOT remove !prev. That change had beed added by someone else.
      
      Ok, I think we are not much interested in who did it, let's
      fix it for good.
      
      [ "It looks like this was caused by me fixing rejects.  That was the
        fancy include-lots-of-context-so-it-wont-apply patch." - akpm ]
      
      Reported-and-bisected-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c127185
  6. Oct 30, 2008
    • Alan Cox's avatar
      nfsd: fix vm overcommit crash · 731572d3
      Alan Cox authored
      
      
      Junjiro R.  Okajima reported a problem where knfsd crashes if you are
      using it to export shmemfs objects and run strict overcommit.  In this
      situation the current->mm based modifier to the overcommit goes through a
      NULL pointer.
      
      We could simply check for NULL and skip the modifier but we've caught
      other real bugs in the past from mm being NULL here - cases where we did
      need a valid mm set up (eg the exec bug about a year ago).
      
      To preserve the checks and get the logic we want shuffle the checking
      around and add a new helper to the vm_ security wrappers
      
      Also fix a current->mm reference in nommu that should use the passed mm
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix build]
      Reported-by: default avatarJunjiro R. Okajima <hooanon05@yahoo.co.jp>
      Acked-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarAlan Cox <alan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      731572d3
  7. Oct 20, 2008
    • Denys Vlasenko's avatar
      mmap.c: deinline a few functions · cb8f488c
      Denys Vlasenko authored
      
      
      __vma_link_file and expand_downwards functions are not small, yeat they
      are marked inline.  They probably had one callsite sometime in the past,
      but now they have more.  In order to prevent similar thing, I also
      deinlined expand_upwards, despite it having only pne callsite.  Nowadays
      gcc auto-inlines such static functions anyway.  In find_extend_vma, I
      removed one extra level of indirection.
      
      Patch is deliberately generated with -U $BIGNUM to make
      it easier to see that functions are big.
      
      Result:
      
      # size */*/mmap.o */vmlinux
         text    data     bss     dec     hex filename
         9514     188      16    9718    25f6 0.org/mm/mmap.o
         9237     188      16    9441    24e1 deinline/mm/mmap.o
      6124402  858996  389480 7372878  70804e 0.org/vmlinux
      6124113  858996  389480 7372589  707f2d deinline/vmlinux
      
      Signed-off-by: default avatarDenys Vlasenko <vda.linux@googlemail.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb8f488c
    • Rik van Riel's avatar
      mmap: handle mlocked pages during map, remap, unmap · ba470de4
      Rik van Riel authored
      
      
      Originally by Nick Piggin <npiggin@suse.de>
      
      Remove mlocked pages from the LRU using "unevictable infrastructure"
      during mmap(), munmap(), mremap() and truncate().  Try to move back to
      normal LRU lists on munmap() when last mlocked mapping removed.  Remove
      PageMlocked() status when page truncated from file.
      
      [akpm@linux-foundation.org: cleanup]
      [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
      [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
      [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
      [akpm@linux-foundation.org: remove bogus kerneldoc token]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarKAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba470de4
    • Nicholas Piggin's avatar
      mlock: mlocked pages are unevictable · b291f000
      Nicholas Piggin authored
      
      
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
  8. Sep 04, 2008
  9. Aug 11, 2008
    • Peter Zijlstra's avatar
      mm: fix mm_take_all_locks() locking order · 7cd5a02f
      Peter Zijlstra authored
      
      
      Lockdep spotted:
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.27-rc1 #270
      -------------------------------------------------------
      qemu-kvm/2033 is trying to acquire lock:
       (&inode->i_data.i_mmap_lock){----}, at: [<ffffffff802996cc>] mm_take_all_locks+0xc2/0xea
      
      but task is already holding lock:
       (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&anon_vma->lock){----}:
             [<ffffffff8025cd37>] __lock_acquire+0x11be/0x14d2
             [<ffffffff8025d0a9>] lock_acquire+0x5e/0x7a
             [<ffffffff804c655b>] _spin_lock+0x3b/0x47
             [<ffffffff8029a2ef>] vma_adjust+0x200/0x444
             [<ffffffff8029a662>] split_vma+0x12f/0x146
             [<ffffffff8029bc60>] mprotect_fixup+0x13c/0x536
             [<ffffffff8029c203>] sys_mprotect+0x1a9/0x21e
             [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
             [<ffffffffffffffff>] 0xffffffffffffffff
      
      -> #0 (&inode->i_data.i_mmap_lock){----}:
             [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
             [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
             [<ffffffff8025d515>] lock_release+0x127/0x14a
             [<ffffffff804c6403>] _spin_unlock+0x1e/0x50
             [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
             [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
             [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
             [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
             [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
             [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
             [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
             [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
             [<ffffffffffffffff>] 0xffffffffffffffff
      
      other info that might help us debug this:
      
      5 locks held by qemu-kvm/2033:
       #0:  (&mm->mmap_sem){----}, at: [<ffffffff802a95d0>] do_mmu_notifier_register+0x55/0x112
       #1:  (mm_all_locks_mutex){--..}, at: [<ffffffff8029963e>] mm_take_all_locks+0x34/0xea
       #2:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
       #3:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
       #4:  (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea
      
      stack backtrace:
      Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270
      
      Call Trace:
       [<ffffffff8025b7c7>] print_circular_bug_tail+0xb8/0xc3
       [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2
       [<ffffffff80259bb1>] ? add_lock_to_list+0x7e/0xad
       [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
       [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea
       [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219
       [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
       [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea
       [<ffffffff8025b202>] ? trace_hardirqs_on_caller+0x4d/0x115
       [<ffffffff802995d9>] ? mm_drop_all_locks+0x7f/0xb0
       [<ffffffff8025d515>] lock_release+0x127/0x14a
       [<ffffffff804c6403>] _spin_unlock+0x1e/0x50
       [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0
       [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112
       [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10
       [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm]
       [<ffffffff8033f9f2>] ? file_has_perm+0x83/0x8e
       [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78
       [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274
       [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78
       [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b
      
      Which the locking hierarchy in mm/rmap.c confirms as valid.
      
      Fix this by first taking all the mapping->i_mmap_lock instances and then
      take all anon_vma->lock instances.
      
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      7cd5a02f
    • Peter Zijlstra's avatar
      lockdep: annotate mm_take_all_locks() · 454ed842
      Peter Zijlstra authored
      
      
      The nesting is correct due to holding mmap_sem, use the new annotation
      to annotate this.
      
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      454ed842
  10. Aug 05, 2008
    • Benny Halevy's avatar
      mm: fix uninitialized variables for find_vma_prepare callers · dfe195fb
      Benny Halevy authored
      
      
      gcc 4.3.0 correctly emits the following warnings.
      When a vma covering addr is found, find_vma_prepare indeed returns without
      setting pprev, rb_link, and rb_parent.
      
        mm/mmap.c: In function `insert_vm_struct':
        mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function
        mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function
        mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function
        mm/mmap.c: In function `copy_vma':
        mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function
        mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function
        mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function
        mm/mmap.c: In function `do_brk':
        mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function
        mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function
        mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function
        mm/mmap.c: In function `mmap_region':
        mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function
        mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function
        mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function
      
      Hugh adds: in fact, none of find_vma_prepare's callers use those values
      when a vma is found to be already covering addr, it's either an error or
      an occasion to munmap and repeat.  Okay, let's quieten the compiler (but I
      would prefer it if pprev, rb_link and rb_parent were meaningful in that
      case, rather than whatever's in them from descending the tree).
      
      Signed-off-by: default avatarBenny Halevy <bhalevy@panasas.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: "Ryan Hope" <rmh3093@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe195fb
  11. Jul 28, 2008
    • Andrea Arcangeli's avatar
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli authored
      
      
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: default avatarAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c
    • Andrea Arcangeli's avatar
      mmu-notifiers: add mm_take_all_locks() operation · 7906d00c
      Andrea Arcangeli authored
      
      
      mm_take_all_locks holds off reclaim from an entire mm_struct.  This allows
      mmu notifiers to register into the mm at any time with the guarantee that
      no mmu operation is in progress on the mm.
      
      This operation locks against the VM for all pte/vma/mm related operations
      that could ever happen on a certain mm.  This includes vmtruncate,
      try_to_unmap, and all page faults.
      
      The caller must take the mmap_sem in write mode before calling
      mm_take_all_locks().  The caller isn't allowed to release the mmap_sem
      until mm_drop_all_locks() returns.
      
      mmap_sem in write mode is required in order to block all operations that
      could modify pagetables and free pages without need of altering the vma
      layout (for example populate_range() with nonlinear vmas).  It's also
      needed in write mode to avoid new anon_vmas to be associated with existing
      vmas.
      
      A single task can't take more than one mm_take_all_locks() in a row or it
      would deadlock.
      
      mm_take_all_locks() and mm_drop_all_locks are expensive operations that
      may have to take thousand of locks.
      
      mm_take_all_locks() can fail if it's interrupted by signals.
      
      When mmu_notifier_register returns, we must be sure that the driver is
      notified if some task is in the middle of a vmtruncate for the 'mm' where
      the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
      is run around the vmtruncation but mmu_notifier_register can run after
      mmu_notifier_invalidate_range_start and before
      mmu_notifier_invalidate_range_end).  Same problem for rmap paths.  And
      we've to remove page pinning to avoid replicating the tlb_gather logic
      inside KVM (and GRU doesn't work well with page pinning regardless of
      needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
      the page, kvm would have no way to notice that it mapped into sptes a page
      that is going into the freelist without a chance of any further
      mmu_notifier notification.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarAndrea Arcangeli <andrea@qumranet.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7906d00c
  12. Jul 24, 2008
    • Andi Kleen's avatar
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen authored
      
      
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: default avatarAdam Litke <agl@us.ibm.com>
      Acked-by: default avatarNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: default avatarAndi Kleen <ak@suse.de>
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • Andy Whitcroft's avatar
      mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations · cdfd4325
      Andy Whitcroft authored
      
      
      With Mel's hugetlb private reservation support patches applied, strict
      overcommit semantics are applied to both shared and private huge page
      mappings.  This can be a problem if an application relied on unlimited
      overcommit semantics for private mappings.  An example of this would be an
      application which maps a huge area with the intention of using it very
      sparsely.  These application would benefit from being able to opt-out of
      the strict overcommit.  It should be noted that prior to hugetlb
      supporting demand faulting all mappings were fully populated and so
      applications of this type should be rare.
      
      This patch stack implements the MAP_NORESERVE mmap() flag for huge page
      mappings.  This flag has the same meaning as for small page mappings,
      suppressing reservations for that mapping.
      
      Thanks to Mel Gorman for reviewing a number of early versions of these
      patches.
      
      This patch:
      
      When a small page mapping is created with mmap() reservations are created
      by default for any memory pages required.  When the region is read/write
      the reservation is increased for every page, no reservation is needed for
      read-only regions (as they implicitly share the zero page).  Reservations
      are tracked via the VM_ACCOUNT vma flag which is present when the region
      has reservation backing it.  When we convert a region from read-only to
      read-write new reservations are aquired and VM_ACCOUNT is set.  However,
      when a read-only map is created with MAP_NORESERVE it is indistinguishable
      from a normal mapping.  When we then convert that to read/write we are
      forced to incorrectly create reservations for it as we have no record of
      the original MAP_NORESERVE.
      
      This patch introduces a new vma flag VM_NORESERVE which records the
      presence of the original MAP_NORESERVE flag.  This allows us to
      distinguish these two circumstances and correctly account the reserve.
      
      As well as fixing this FIXME in the code, this makes it much easier to
      introduce MAP_NORESERVE support for huge pages as this flag is available
      consistantly for the life of the mapping.  VM_ACCOUNT on the other hand is
      heavily used at the generic level in association with small pages.
      
      Signed-off-by: default avatarAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cdfd4325
    • Jan Beulich's avatar
      mm: remove double indirection on tlb parameter to free_pgd_range() & Co · 42b77728
      Jan Beulich authored
      
      
      The double indirection here is not needed anywhere and hence (at least)
      confusing.
      
      Signed-off-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: default avatarJeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      42b77728
  13. Jul 09, 2008
  14. Jun 06, 2008
  15. May 24, 2008
  16. Apr 29, 2008
    • Matt Helsley's avatar
      procfs task exe symlink · 925d1c40
      Matt Helsley authored
      
      
      The kernel implements readlink of /proc/pid/exe by getting the file from
      the first executable VMA.  Then the path to the file is reconstructed and
      reported as the result.
      
      Because of the VMA walk the code is slightly different on nommu systems.
      This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      walking the VMAs to find the first executable file-backed VMA we store a
      reference to the exec'd file in the mm_struct.
      
      That reference would prevent the filesystem holding the executable file
      from being unmounted even after unmapping the VMAs.  So we track the number
      of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      unmapped.  This avoids pinning the mounted filesystem.
      
      [akpm@linux-foundation.org: improve comments]
      [yamamoto@valinux.co.jp: fix dup_mmap]
      Signed-off-by: default avatarMatt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Howells <dhowells@redhat.com>
      Cc:"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      925d1c40
  17. Apr 28, 2008
    • Lee Schermerhorn's avatar
      mempolicy: rename mpol_copy to mpol_dup · 846a16bf
      Lee Schermerhorn authored
      
      
      This patch renames mpol_copy() to mpol_dup() because, well, that's what it
      does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
      existing mempolicy, allocates a new one and copies the contents.
      
      In a later patch, I want to use the name mpol_copy() to copy the contents from
      one mempolicy to another like, e.g., strcpy() does for strings.
      
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      846a16bf
    • Lee Schermerhorn's avatar
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn authored
      
      
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      
      Signed-off-by: default avatarLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • Oleg Nesterov's avatar
      mmap_region: cleanup the final vma_merge() related code · 4d3d5b41
      Oleg Nesterov authored
      
      
      It is not easy to actually understand the "if (!file || !vma_merge())"
      code, turn it into "if (file && vma_merge())".  This makes immediately
      obvious that the subsequent "if (file)" is superfluous.
      
      As Hugh Dickins pointed out, we can also factor out the ->i_writecount
      corrections, and add a small comment about that.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@tv-sign.ru>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4d3d5b41
  18. Feb 09, 2008
    • Nicholas Piggin's avatar
      mm: special mapping nopage · b1d0e4f5
      Nicholas Piggin authored
      
      
      Convert special mapping install from nopage to fault.
      
      Because the "vm_file" is NULL for the special mapping, the generic VM
      code has messed up "vm_pgoff" thinking that it's an anonymous mapping
      and the offset does't matter.  For that reason, we need to undo the
      vm_pgoff offset that got added into vmf->pgoff.
      
      [ We _really_ should clean that up - either by making this whole special
        mapping code just use a real file entry rather than that ugly array of
        "struct page" pointers, or by just making the VM code realize that
        even if vm_file is NULL it may not be a regular anonymous mmap.
      							 - Linus ]
      
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Cc: linux-mm@kvack.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b1d0e4f5
  19. Feb 06, 2008
    • Jiri Kosina's avatar
      brk: check the lower bound properly · 4cc6028d
      Jiri Kosina authored
      
      
      There is a check in sys_brk(), that tries to make sure that we do not
      underflow the area that is dedicated to brk heap.
      
      The check is however wrong, as it assumes that brk area starts immediately
      after the end of the code (+bss), which is wrong for example in
      environments with randomized brk start. The proper way is to check whether
      the address is not below the start_brk address.
      
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      4cc6028d
  20. Feb 05, 2008
  21. Feb 04, 2008
  22. Jan 30, 2008
    • Jiri Kosina's avatar
      x86: randomize brk · c1d171a0
      Jiri Kosina authored
      
      
      Randomize the location of the heap (brk) for i386 and x86_64.  The range is
      randomized in the range starting at current brk location up to 0x02000000
      offset for both architectures.  This, together with
      pie-executable-randomization.patch and
      pie-executable-randomization-fix.patch, should make the address space
      randomization on i386 and x86_64 complete.
      
      Arjan says:
      
      This is known to break older versions of some emacs variants, whose dumper
      code assumed that the last variable declared in the program is equal to the
      start of the dynamically allocated memory region.
      
      (The dumper is the code where emacs effectively dumps core at the end of it's
      compilation stage; this coredump is then loaded as the main program during
      normal use)
      
      iirc this was 5 years or so; we found this way back when I was at RH and we
      first did the security stuff there (including this brk randomization).  It
      wasn't all variants of emacs, and it got fixed as a result (I vaguely remember
      that emacs already had code to deal with it for other archs/oses, just
      ifdeffed wrongly).
      
      It's a rare and wrong assumption as a general thing, just on x86 it mostly
      happened to be true (but to be honest, it'll break too if gcc does
      something fancy or if the linker does a non-standard order).  Still its
      something we should at least document.
      
      Note 2: afaik it only broke the emacs *build*.  I'm not 100% sure about that
      (it IS 5 years ago) though.
      
      [ akpm@linux-foundation.org: deuglification ]
      
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      c1d171a0
  23. Jan 25, 2008
  24. Dec 05, 2007
  25. Oct 23, 2007
Loading