Skip to content
  1. Apr 27, 2013
    • Linus Torvalds's avatar
      vm: add no-mmu vm_iomap_memory() stub · 3c0b9de6
      Linus Torvalds authored
      
      
      I think we could just move the full vm_iomap_memory() function into
      util.h or similar, but I didn't get any reply from anybody actually
      using nommu even to this trivial patch, so I'm not going to touch it any
      more than required.
      
      Here's the fairly minimal stub to make the nommu case at least
      potentially work.  It doesn't seem like anybody cares, though.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c0b9de6
  2. Apr 17, 2013
  3. Apr 16, 2013
    • Linus Torvalds's avatar
      vm: add vm_iomap_memory() helper function · b4cbb197
      Linus Torvalds authored
      
      
      Various drivers end up replicating the code to mmap() their memory
      buffers into user space, and our core memory remapping function may be
      very flexible but it is unnecessarily complicated for the common cases
      to use.
      
      Our internal VM uses pfn's ("page frame numbers") which simplifies
      things for the VM, and allows us to pass physical addresses around in a
      denser and more efficient format than passing a "phys_addr_t" around,
      and having to shift it up and down by the page size.  But it just means
      that drivers end up doing that shifting instead at the interface level.
      
      It also means that drivers end up mucking around with internal VM things
      like the vma details (vm_pgoff, vm_start/end) way more than they really
      need to.
      
      So this just exports a function to map a certain physical memory range
      into user space (using a phys_addr_t based interface that is much more
      natural for a driver) and hides all the complexity from the driver.
      Some drivers will still end up tweaking the vm_page_prot details for
      things like prefetching or cacheability etc, but that's actually
      relevant to the driver, rather than caring about what the page offset of
      the mapping is into the particular IO memory region.
      
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4cbb197
  4. Apr 12, 2013
    • Dave Hansen's avatar
      x86-32: Fix possible incomplete TLB invalidate with PAE pagetables · 1de14c3c
      Dave Hansen authored
      This patch attempts to fix:
      
      	https://bugzilla.kernel.org/show_bug.cgi?id=56461
      
      
      
      The symptom is a crash and messages like this:
      
      	chrome: Corrupted page table at address 34a03000
      	*pdpt = 0000000000000000 *pde = 0000000000000000
      	Bad pagetable: 000f [#1] PREEMPT SMP
      
      Ingo guesses this got introduced by commit 611ae8e3 ("x86/tlb:
      enable tlb flush range support for x86") since that code started to free
      unused pagetables.
      
      On x86-32 PAE kernels, that new code has the potential to free an entire
      PMD page and will clear one of the four page-directory-pointer-table
      (aka pgd_t entries).
      
      The hardware aggressively "caches" these top-level entries and invlpg
      does not actually affect the CPU's copy.  If we clear one we *HAVE* to
      do a full TLB flush, otherwise we might continue using a freed pmd page.
      (note, we do this properly on the population side in pud_populate()).
      
      This patch tracks whenever we clear one of these entries in the 'struct
      mmu_gather', and ensures that we follow up with a full tlb flush.
      
      BTW, I disassembled and checked that:
      
      	if (tlb->fullmm == 0)
      and
      	if (!tlb->fullmm && !tlb->need_flush_all)
      
      generate essentially the same code, so there should be zero impact there
      to the !PAE case.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Artem S Tashkinov <t.artem@mailcity.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1de14c3c
  5. Apr 04, 2013
    • Jan Stancek's avatar
      mm: prevent mmap_cache race in find_vma() · b6a9b7f6
      Jan Stancek authored
      
      
      find_vma() can be called by multiple threads with read lock
      held on mm->mmap_sem and any of them can update mm->mmap_cache.
      Prevent compiler from re-fetching mm->mmap_cache, because other
      readers could update it in the meantime:
      
                     thread 1                             thread 2
                                              |
        find_vma()                            |  find_vma()
          struct vm_area_struct *vma = NULL;  |
          vma = mm->mmap_cache;               |
          if (!(vma && vma->vm_end > addr     |
              && vma->vm_start <= addr)) {    |
                                              |    mm->mmap_cache = vma;
          return vma;                         |
           ^^ compiler may optimize this      |
              local variable out and re-read  |
              mm->mmap_cache                  |
      
      This issue can be reproduced with gcc-4.8.0-1 on s390x by running
      mallocstress testcase from LTP, which triggers:
      
        kernel BUG at mm/rmap.c:1088!
          Call Trace:
           ([<000003d100c57000>] 0x3d100c57000)
            [<000000000023a1c0>] do_wp_page+0x2fc/0xa88
            [<000000000023baae>] handle_pte_fault+0x41a/0xac8
            [<000000000023d832>] handle_mm_fault+0x17a/0x268
            [<000000000060507a>] do_protection_exception+0x1e2/0x394
            [<0000000000603a04>] pgm_check_handler+0x138/0x13c
            [<000003fffcf1f07a>] 0x3fffcf1f07a
          Last Breaking-Event-Address:
            [<000000000024755e>] page_add_new_anon_rmap+0xc2/0x168
      
      Thanks to Jakub Jelinek for his insight on gcc and helping to
      track this down.
      
      Signed-off-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6a9b7f6
  6. Mar 29, 2013
  7. Mar 22, 2013
  8. Mar 15, 2013
    • Michel Lespinasse's avatar
      mm/fremap.c: fix possible oops on error path · a2362d24
      Michel Lespinasse authored
      
      
      The vm_flags introduced in 6d7825b1 ("mm/fremap.c: fix oops on error
      path") is supposed to avoid a compiler warning about unitialized
      vm_flags without changing the generated code.
      
      However I am concerned that this is going to be very brittle, and fail
      with some compiler versions. The failure could be either of:
      
      - compiler could actually load vma->vm_flags before checking for the
        !vma condition, thus reintroducing the oops
      
      - compiler could optimize out the !vma check, since the pointer just got
        dereferenced shortly before (so the compiler knows it can't be NULL!)
      
      I propose reversing this part of the change and initializing vm_flags to 0
      just to avoid the bogus uninitialized use warning.
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Tommi Rantala <tt.rantala@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a2362d24
  9. Mar 13, 2013
  10. Mar 12, 2013
    • Stephen Rothwell's avatar
      Select VIRT_TO_BUS directly where needed · 4febd95a
      Stephen Rothwell authored
      
      
      In commit 887cbce0 ("arch Kconfig: centralise ARCH_NO_VIRT_TO_BUS")
      I introduced the config sybmol HAVE_VIRT_TO_BUS and selected that where
      needed.  I am not sure what I was thinking.  Instead, just directly
      select VIRT_TO_BUS where it is needed.
      
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4febd95a
    • Mathieu Desnoyers's avatar
      Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and security keys · 8aec0f5d
      Mathieu Desnoyers authored
      
      
      Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to
      compat_process_vm_rw() shows that the compatibility code requires an
      explicit "access_ok()" check before calling
      compat_rw_copy_check_uvector(). The same difference seems to appear when
      we compare fs/read_write.c:do_readv_writev() to
      fs/compat.c:compat_do_readv_writev().
      
      This subtle difference between the compat and non-compat requirements
      should probably be debated, as it seems to be error-prone. In fact,
      there are two others sites that use this function in the Linux kernel,
      and they both seem to get it wrong:
      
      Now shifting our attention to fs/aio.c, we see that aio_setup_iocb()
      also ends up calling compat_rw_copy_check_uvector() through
      aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to
      be missing. Same situation for
      security/keys/compat.c:compat_keyctl_instantiate_key_iov().
      
      I propose that we add the access_ok() check directly into
      compat_rw_copy_check_uvector(), so callers don't have to worry about it,
      and it therefore makes the compat call code similar to its non-compat
      counterpart. Place the access_ok() check in the same location where
      copy_from_user() can trigger a -EFAULT error in the non-compat code, so
      the ABI behaviors are alike on both compat and non-compat.
      
      While we are here, fix compat_do_readv_writev() so it checks for
      compat_rw_copy_check_uvector() negative return values.
      
      And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error
      handling.
      
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8aec0f5d
  11. Mar 08, 2013
    • Konstantin Khlebnikov's avatar
      memcg: initialize kmem-cache destroying work earlier · 15cf17d2
      Konstantin Khlebnikov authored
      
      
      Fix a warning from lockdep caused by calling cancel_work_sync() for
      uninitialized struct work.  This path has been triggered by destructon
      kmem-cache hierarchy via destroying its root kmem-cache.
      
        cache ffff88003c072d80
        obj ffff88003b410000 cache ffff88003c072d80
        obj ffff88003b924000 cache ffff88003c20bd40
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        Pid: 2825, comm: insmod Tainted: G           O 3.9.0-rc1-next-20130307+ #611
        Call Trace:
          __lock_acquire+0x16a2/0x1cb0
          lock_acquire+0x8a/0x120
          flush_work+0x38/0x2a0
          __cancel_work_timer+0x89/0xf0
          cancel_work_sync+0xb/0x10
          kmem_cache_destroy_memcg_children+0x81/0xb0
          kmem_cache_destroy+0xf/0xe0
          init_module+0xcb/0x1000 [kmem_test]
          do_one_initcall+0x11a/0x170
          load_module+0x19b0/0x2320
          SyS_init_module+0xc6/0xf0
          system_call_fastpath+0x16/0x1b
      
      Example module to demonstrate:
      
        #include <linux/module.h>
        #include <linux/slab.h>
        #include <linux/mm.h>
        #include <linux/workqueue.h>
      
        int __init mod_init(void)
        {
        	int size = 256;
        	struct kmem_cache *cache;
        	void *obj;
        	struct page *page;
      
        	cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
        	if (!cache)
        		return -ENOMEM;
      
        	printk("cache %p\n", cache);
      
        	obj = kmem_cache_alloc(cache, GFP_KERNEL);
        	if (obj) {
        		page = virt_to_head_page(obj);
        		printk("obj %p cache %p\n", obj, page->slab_cache);
        		kmem_cache_free(cache, obj);
        	}
      
        	flush_scheduled_work();
      
        	obj = kmem_cache_alloc(cache, GFP_KERNEL);
        	if (obj) {
        		page = virt_to_head_page(obj);
        		printk("obj %p cache %p\n", obj, page->slab_cache);
        		kmem_cache_free(cache, obj);
        	}
      
        	kmem_cache_destroy(cache);
      
        	return -EBUSY;
        }
      
        module_init(mod_init);
        MODULE_LICENSE("GPL");
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15cf17d2
    • Hugh Dickins's avatar
      ksm: fix m68k build: only NUMA needs pfn_to_nid · d8fc16a8
      Hugh Dickins authored
      
      
      A CONFIG_DISCONTIGMEM=y m68k config gave
      
        mm/ksm.c: In function `get_kpfn_nid':
        mm/ksm.c:492: error: implicit declaration of function `pfn_to_nid'
      
      linux/mmzone.h declares it for CONFIG_SPARSEMEM and CONFIG_FLATMEM, but
      expects the arch's asm/mmzone.h to declare it for CONFIG_DISCONTIGMEM
      (see arch/mips/include/asm/mmzone.h for example).
      
      Or perhaps it is only expected when CONFIG_NUMA=y: too much of a maze,
      and m68k got away without it so far, so fix the build in mm/ksm.c.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Petr Holasek <pholasek@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8fc16a8
    • KOSAKI Motohiro's avatar
      mm/mempolicy.c: fix sp_node_init() argument ordering · 7880639c
      KOSAKI Motohiro authored
      
      
      Currently, n_new is wrongly initialized.  start and end parameter are
      inverted.  Let's fix it.
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7880639c
    • Hillf Danton's avatar
      mm/mempolicy.c: fix wrong sp_node insertion · 5ca39575
      Hillf Danton authored
      
      
      n->end is accessed in sp_insert(). Thus it should be update
      before calling sp_insert(). This mistake may make kernel panic.
      
      Signed-off-by: default avatarHillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ca39575
  12. Mar 02, 2013
    • Yinghai Lu's avatar
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu authored
      
      
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      
      Reported-by: default avatarTim Gardner <tim.gardner@canonical.com>
      Reported-by: default avatarDon Morris <don.morris@hp.com>
      Bisected-by: default avatarDon Morris <don.morris@hp.com>
      Tested-by: default avatarDon Morris <don.morris@hp.com>
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
    • Al Viro's avatar
      fix nommu breakage in shmem.c · 26567cdb
      Al Viro authored
      
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      26567cdb
  13. Feb 28, 2013
    • Sasha Levin's avatar
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin authored
      
      
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: default avatarPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
    • Stephen Rothwell's avatar
      arch Kconfig: centralise CONFIG_ARCH_NO_VIRT_TO_BUS · 887cbce0
      Stephen Rothwell authored
      
      
      Change it to CONFIG_HAVE_VIRT_TO_BUS and set it in all architecures
      that already provide virt_to_bus().
      
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: H Hartley Sweeten <hartleys@visionengravers.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      887cbce0
    • Michel Lespinasse's avatar
      mm: accelerate munlock() treatment of THP pages · ff6a6da6
      Michel Lespinasse authored
      
      
      munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
      at a time.  When munlocking THP pages (or the huge zero page), this
      resulted in taking the mm->page_table_lock 512 times in a row.
      
      We can do better by making use of the page_mask returned by
      follow_page_mask (for the huge zero page case), or the size of the page
      munlock_vma_page() operated on (for the true THP page case).
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff6a6da6
  14. Feb 27, 2013
    • Linus Torvalds's avatar
      mm: do not grow the stack vma just because of an overrun on preceding vma · 09884964
      Linus Torvalds authored
      
      
      The stack vma is designed to grow automatically (marked with VM_GROWSUP
      or VM_GROWSDOWN depending on architecture) when an access is made beyond
      the existing boundary.  However, particularly if you have not limited
      your stack at all ("ulimit -s unlimited"), this can cause the stack to
      grow even if the access was really just one past *another* segment.
      
      And that's wrong, especially since we first grow the segment, but then
      immediately later enforce the stack guard page on the last page of the
      segment.  So _despite_ first growing the stack segment as a result of
      the access, the kernel will then make the access cause a SIGSEGV anyway!
      
      So do the same logic as the guard page check does, and consider an
      access to within one page of the next segment to be a bad access, rather
      than growing the stack to abut the next segment.
      
      Reported-and-tested-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09884964
  15. Feb 26, 2013
  16. Feb 24, 2013
    • Hugh Dickins's avatar
      ksm: allocate roots when needed · ef53d16c
      Hugh Dickins authored
      
      
      It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
      allocated, particularly when very few users will ever actually tune
      merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
      deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.
      
      Start off with 1+1 statically allocated, then if merge_across_nodes is
      ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
      free up the extra if it's tuned back, that would be a waste of effort.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef53d16c
    • Hugh Dickins's avatar
      mm: cleanup "swapcache" in do_swap_page · 56f31801
      Hugh Dickins authored
      
      
      I dislike the way in which "swapcache" gets used in do_swap_page():
      there is always a page from swapcache there (even if maybe uncached by
      the time we lock it), but tests are made according to "swapcache".
      Rework that with "page != swapcache", as has been done in unuse_pte().
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56f31801
    • Hugh Dickins's avatar
      mm,ksm: swapoff might need to copy · 9e16b7fb
      Hugh Dickins authored
      
      
      Before establishing that KSM page migration was the cause of my
      WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
      lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
      many respects is equivalent to faulting in a page.
      
      In fact I've never caught that as the cause: but in theory it does at
      least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
      avoid bringing a KSM page back in when it's not supposed to be.
      
      I intended to copy how it's done in do_swap_page(), but have a strong
      aversion to how "swapcache" ends up being used there: rework it with
      "page != swapcache".
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e16b7fb
    • Hugh Dickins's avatar
      mm,ksm: FOLL_MIGRATION do migration_entry_wait · 5117b3b8
      Hugh Dickins authored
      
      
      In "ksm: remove old stable nodes more thoroughly" I said that I'd never
      seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
      but it soon appeared once I tried fuller tests on the whole series.
      
      It turned out to be due to the KSM page migration itself: unmerge_and_
      remove_all_rmap_items() failed to locate and replace all the KSM pages,
      because of that hiatus in page migration when old pte has been replaced
      by migration entry, but not yet by new pte.  follow_page() finds no page
      at that instant, but a KSM page reappears shortly after, without a
      fault.
      
      Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
      for KSM's break_cow().  I'd have preferred to avoid another flag, and do
      it every time, in case someone else makes the same easy mistake; but did
      not find another transgressor (the common get_user_pages() is of course
      safe), and cannot be sure that every follow_page() caller is prepared to
      sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
      already sleep there, since anon_vma locking was changed to mutex, but
      maybe that's somehow excluded.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5117b3b8
    • Hugh Dickins's avatar
      ksm: shrink 32-bit rmap_item back to 32 bytes · bc56620b
      Hugh Dickins authored
      
      
      Think of struct rmap_item as an extension of struct page (restricted to
      MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
      small, especially on 32-bit architectures of limited lowmem.
      
      Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
      making no change to its 64-byte struct rmap_item; but bloats the 32-bit
      struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
      rounds up to 40 bytes once allocated from slab.  We'd better avoid that.
      
      Hey, I only just remembered that the anon_vma pointer in struct
      rmap_item has no purpose until the rmap_item is hung from a stable tree
      node (which has its own nid field); and rmap_item's nid field no purpose
      than to say which tree root to tell rb_erase() when unlinking from an
      unstable tree.
      
      Double them up in a union.  There's just one place where we set anon_vma
      early (when we already hold mmap_sem): now we must remove tree_rmap_item
      from its unstable tree there, before overwriting nid.  No need to
      spatter BUG()s around: we'd be seeing oopses if this were wrong.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc56620b
    • Hugh Dickins's avatar
      ksm: treat unstable nid like in stable tree · b599cbdf
      Hugh Dickins authored
      
      
      An inconsistency emerged in reviewing the NUMA node changes to KSM: when
      meeting a page from the wrong NUMA node in a stable tree, we say that
      it's okay for comparisons, but not as a leaf for merging; whereas when
      meeting a page from the wrong NUMA node in an unstable tree, we bail out
      immediately.
      
      Now, it might be that a wrong NUMA node in an unstable tree is more
      likely to correlate with instablility (different content, with rbnode
      now misplaced) than page migration; but even so, we are accustomed to
      instablility in the unstable tree.
      
      Without strong evidence for which strategy is generally better, I'd
      rather be consistent with what's done in the stable tree: accept a page
      from the wrong NUMA node for comparison, but not as a leaf for merging.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b599cbdf
    • Hugh Dickins's avatar
      ksm: add some comments · 8fdb3dbf
      Hugh Dickins authored
      
      
      Added slightly more detail to the Documentation of merge_across_nodes, a
      few comments in areas indicated by review, and renamed get_ksm_page()'s
      argument from "locked" to "lock_it".  No functional change.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fdb3dbf
    • Greg Thelen's avatar
      tmpfs: fix mempolicy object leaks · 49cd0a5c
      Greg Thelen authored
      
      
      Fix several mempolicy leaks in the tmpfs mount logic.  These leaks are
      slow - on the order of one object leaked per mount attempt.
      
      Leak 1 (umount doesn't free mpol allocated in mount):
          while true; do
              mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      Leak 2 (errors parsing remount options will leak mpol):
          mount -t tmpfs -o size=100M nodev /mnt
          while true; do
              mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
          done
          umount /mnt
      
      Leak 3 (multiple mpol per mount leak mpol):
          while true; do
              mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
              umount /mnt
          done
      
      This patch fixes all of the above.  I could have broken the patch into
      three pieces but is seemed easier to review as one.
      
      [akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49cd0a5c
    • Greg Thelen's avatar
      tmpfs: fix use-after-free of mempolicy object · 5f00110f
      Greg Thelen authored
      
      
      The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
      option is not specified in the remount request.  A new policy can be
      specified if mpol=M is given.
      
      Before this patch remounting an mpol bound tmpfs without specifying
      mpol= mount option in the remount request would set the filesystem's
      mempolicy object to a freed mempolicy object.
      
      To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
          # mkdir /tmp/x
      
          # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0
      
          # mount -o remount,size=200M nodev /tmp/x
      
          # grep /tmp/x /proc/mounts
          nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
              # note ? garbage in mpol=... output above
      
          # dd if=/dev/zero of=/tmp/x/f count=1
              # panic here
      
      Panic:
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          [...]
          Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
          Call Trace:
            mpol_shared_policy_init+0xa5/0x160
            shmem_get_inode+0x209/0x270
            shmem_mknod+0x3e/0xf0
            shmem_create+0x18/0x20
            vfs_create+0xb5/0x130
            do_last+0x9a1/0xea0
            path_openat+0xb3/0x4d0
            do_filp_open+0x42/0xa0
            do_sys_open+0xfe/0x1e0
            compat_sys_open+0x1b/0x20
            cstar_dispatch+0x7/0x1f
      
      Non-debug kernels will not crash immediately because referencing the
      dangling mpol will not cause a fault.  Instead the filesystem will
      reference a freed mempolicy object, which will cause unpredictable
      behavior.
      
      The problem boils down to a dropped mpol reference below if
      shmem_parse_options() does not allocate a new mpol:
      
          config = *sbinfo
          shmem_parse_options(data, &config, true)
          mpol_put(sbinfo->mpol)
          sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */
      
      This patch avoids the crash by not releasing the mempolicy if
      shmem_parse_options() doesn't create a new mpol.
      
      How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
      not look back further.
      
      Signed-off-by: default avatarGreg Thelen <gthelen@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f00110f
    • Mel Gorman's avatar
      mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages · 67d46b29
      Mel Gorman authored
      
      
      Rob van der Heij reported the following (paraphrased) on private mail.
      
      	The scenario is that I want to avoid backups to fill up the page
      	cache and purge stuff that is more likely to be used again (this is
      	with s390x Linux on z/VM, so I don't give it as much memory that
      	we don't care anymore). So I have something with LD_PRELOAD that
      	intercepts the close() call (from tar, in this case) and issues
      	a posix_fadvise() just before closing the file.
      
      	This mostly works, except for small files (less than 14 pages)
      	that remains in page cache after the face.
      
      Unfortunately Rob has not had a chance to test this exact patch but the
      test program below should be reproducing the problem he described.
      
      The issue is the per-cpu pagevecs for LRU additions.  If the pages are
      added by one CPU but fadvise() is called on another then the pages
      remain resident as the invalidate_mapping_pages() only drains the local
      pagevecs via its call to pagevec_release().  The user-visible effect is
      that a program that uses fadvise() properly is not obeyed.
      
      A possible fix for this is to put the necessary smarts into
      invalidate_mapping_pages() to globally drain the LRU pagevecs if a
      pagevec page could not be discarded.  The downside with this is that an
      inode cache shrink would send a global IPI and memory pressure
      potentially causing global IPI storms is very undesirable.
      
      Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
      check if invalidate_mapping_pages() discarded all the requested pages.
      If a subset of pages are discarded it drains the LRU pagevecs and tries
      again.  If the second attempt fails, it assumes it is due to the pages
      being mapped, locked or dirty and does not care.  With this patch, an
      application using fadvise() correctly will be obeyed but there is a
      downside that a malicious application can force the kernel to send
      global IPIs and increase overhead.
      
      If accepted, I would like this to be considered as a -stable candidate.
      It's not an urgent issue but it's a system call that is not working as
      advertised which is weak.
      
      The following test program demonstrates the problem.  It should never
      report that pages are still resident but will without this patch.  It
      assumes that CPU 0 and 1 exist.
      
      int main() {
      	int fd;
      	int pagesize = getpagesize();
      	ssize_t written = 0, expected;
      	char *buf;
      	unsigned char *vec;
      	int resident, i;
      	cpu_set_t set;
      
      	/* Prepare a buffer for writing */
      	expected = FILESIZE_PAGES * pagesize;
      	buf = malloc(expected + 1);
      	if (buf == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      	buf[expected] = 0;
      	memset(buf, 'a', expected);
      
      	/* Prepare the mincore vec */
      	vec = malloc(FILESIZE_PAGES);
      	if (vec == NULL) {
      		printf("ENOMEM\n");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Bind ourselves to CPU 0 */
      	CPU_ZERO(&set);
      	CPU_SET(0, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* open file, unlink and write buffer */
      	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
      	if (fd == -1) {
      		perror("open");
      		exit(EXIT_FAILURE);
      	}
      	unlink("fadvise-test-file");
      	while (written < expected) {
      		ssize_t this_write;
      		this_write = write(fd, buf + written, expected - written);
      
      		if (this_write == -1) {
      			perror("write");
      			exit(EXIT_FAILURE);
      		}
      
      		written += this_write;
      	}
      	free(buf);
      
      	/*
      	 * Force ourselves to another CPU. If fadvise only flushes the local
      	 * CPUs pagevecs then the fadvise will fail to discard all file pages
      	 */
      	CPU_ZERO(&set);
      	CPU_SET(1, &set);
      	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
      		perror("sched_setaffinity");
      		exit(EXIT_FAILURE);
      	}
      
      	/* sync and fadvise to discard the page cache */
      	fsync(fd);
      	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
      		perror("posix_fadvise");
      		exit(EXIT_FAILURE);
      	}
      
      	/* map the file and use mincore to see which parts of it are resident */
      	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
      	if (buf == NULL) {
      		perror("mmap");
      		exit(EXIT_FAILURE);
      	}
      	if (mincore(buf, expected, vec) == -1) {
      		perror("mincore");
      		exit(EXIT_FAILURE);
      	}
      
      	/* Check residency */
      	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
      		if (vec[i])
      			resident++;
      	}
      	if (resident != 0) {
      		printf("Nr unexpected pages resident: %d\n", resident);
      		exit(EXIT_FAILURE);
      	}
      
      	munmap(buf, expected);
      	close(fd);
      	free(vec);
      	exit(EXIT_SUCCESS);
      }
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Tested-by: default avatarRob van der Heij <rvdheij@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67d46b29
    • Cliff Wickman's avatar
      mm: export mmu notifier invalidates · fa794199
      Cliff Wickman authored
      
      
      We at SGI have a need to address some very high physical address ranges
      with our GRU (global reference unit), sometimes across partitioned
      machine boundaries and sometimes with larger addresses than the cpu
      supports.  We do this with the aid of our own 'extended vma' module
      which mimics the vma.  When something (either unmap or exit) frees an
      'extended vma' we use the mmu notifiers to clean them up.
      
      We had been able to mimic the functions
      __mmu_notifier_invalidate_range_start() and
      __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
      walking the per-mm notifier list.  But with the change to a global srcu
      lock (static in mmu_notifier.c) we can no longer do that.  Our module has
      no access to that lock.
      
      So we request that these two functions be exported.
      
      Signed-off-by: default avatarCliff Wickman <cpw@sgi.com>
      Acked-by: default avatarRobin Holt <holt@sgi.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa794199
    • Michel Lespinasse's avatar
      mm: accelerate mm_populate() treatment of THP pages · 240aadee
      Michel Lespinasse authored
      
      
      This change adds a follow_page_mask function which is equivalent to
      follow_page, but with an extra page_mask argument.
      
      follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
      a THP page, and to 0 in other cases.
      
      __get_user_pages() makes use of this in order to accelerate populating
      THP ranges - that is, when both the pages and vmas arrays are NULL, we
      don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
      we also avoid taking mm->page_table_lock that many times).
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      240aadee
    • Michel Lespinasse's avatar
      mm: use long type for page counts in mm_populate() and get_user_pages() · 28a35716
      Michel Lespinasse authored
      
      
      Use long type for page counts in mm_populate() so as to avoid integer
      overflow when running the following test code:
      
      int main(void) {
        void *p = mmap(NULL, 0x100000000000, PROT_READ,
                       MAP_PRIVATE | MAP_ANON, -1, 0);
        printf("p: %p\n", p);
        mlockall(MCL_CURRENT);
        printf("done\n");
        return 0;
      }
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28a35716
    • Zhang Yanfei's avatar
      mm: accurately document nr_free_*_pages functions with code comments · e0fb5815
      Zhang Yanfei authored
      
      
      nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
      are horribly badly named, so accurately document them with code comments
      in case of the misuse of them.
      
      [akpm@linux-foundation.org: tweak comments]
      Reviewed-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0fb5815
Loading