Skip to content
  1. Oct 17, 2013
    • Hugh Dickins's avatar
      mm: revert mremap pud_free anti-fix · 57a8f0cd
      Hugh Dickins authored
      
      
      Revert commit 1ecfd533 ("mm/mremap.c: call pud_free() after fail
      calling pmd_alloc()").
      
      The original code was correct: pud_alloc(), pmd_alloc(), pte_alloc_map()
      ensure that the pud, pmd, pt is already allocated, and seldom do they
      need to allocate; on failure, upper levels are freed if appropriate by
      the subsequent do_munmap().  Whereas commit 1ecfd533 did an
      unconditional pud_free() of a most-likely still-in-use pud: saved only
      by the near-impossiblity of pmd_alloc() failing.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Chen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57a8f0cd
    • Hugh Dickins's avatar
      mm: fix BUG in __split_huge_page_pmd · 750e8165
      Hugh Dickins authored
      
      
      Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of
      __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED).
      
      It's invalid: we don't always have down_write of mmap_sem there: a racing
      do_huge_pmd_wp_page() might have copied-on-write to another huge page
      before our split_huge_page() got the anon_vma lock.
      
      Forget the BUG_ON, just go back and try again if this happens.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      750e8165
    • Krzysztof Kozlowski's avatar
      swap: fix set_blocksize race during swapon/swapoff · 5b808a23
      Krzysztof Kozlowski authored
      
      
      Fix race between swapoff and swapon.  Swapoff used old_block_size from
      swap_info outside of swapon_mutex so it could be overwritten by
      concurrent swapon.
      
      The race has visible effect only if more than one swap block device
      exists with different block sizes (e.g.  /dev/sda1 with block size 4096
      and /dev/sdb1 with 512).  In such case it leads to setting the blocksize
      of swapped off device with wrong blocksize.
      
      The bug can be triggered with multiple concurrent swapoff and swapon:
      0. Swap for some device is on.
      1. swapoff:
      First the swapoff is called on this device and "struct swap_info_struct
      *p" is assigned. This is done under swap_lock however this lock is
      released for the call try_to_unuse().
      
      2. swapon:
      After the assignment above (and before acquiring swapon_mutex &
      swap_lock by swapoff) the swapon is called on the same device.
      The p->old_block_size is assigned to the value of block_size the device.
      This block size should be the same as previous but sometimes it is not.
      The swapon ends successfully.
      
      3. swapoff:
      Swapoff resumes, grabs the locks and mutex and continues to disable this
      swap device. Now it sets the block size to value taken from swap_info
      which was overwritten by swapon in 2.
      
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Reported-by: default avatarWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b808a23
    • Fengguang Wu's avatar
      writeback: fix negative bdi max pause · e3b6c655
      Fengguang Wu authored
      
      
      Toralf runs trinity on UML/i386.  After some time it hangs and the last
      message line is
      
      	BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521]
      
      It's found that pages_dirtied becomes very large.  More than 1000000000
      pages in this case:
      
      	period = HZ * pages_dirtied / task_ratelimit;
      	BUG_ON(pages_dirtied > 2000000000);
      	BUG_ON(pages_dirtied > 1000000000);      <---------
      
      UML debug printf shows that we got negative pause here:
      
      	ick: pause : -984
      	ick: pages_dirtied : 0
      	ick: task_ratelimit: 0
      
      	 pause:
      	+       if (pause < 0)  {
      	+               extern int printf(char *, ...);
      	+               printf("ick : pause : %li\n", pause);
      	+               printf("ick: pages_dirtied : %lu\n", pages_dirtied);
      	+               printf("ick: task_ratelimit: %lu\n", task_ratelimit);
      	+               BUG_ON(1);
      	+       }
      	        trace_balance_dirty_pages(bdi,
      
      Since pause is bounded by [min_pause, max_pause] where min_pause is also
      bounded by max_pause.  It's suspected and demonstrated that the
      max_pause calculation goes wrong:
      
      	ick: pause : -717
      	ick: min_pause : -177
      	ick: max_pause : -717
      	ick: pages_dirtied : 14
      	ick: task_ratelimit: 0
      
      The problem lies in the two "long = unsigned long" assignments in
      bdi_max_pause() which might go negative if the highest bit is 1, and the
      min_t(long, ...) check failed to protect it falling under 0.  Fix all of
      them by using "unsigned long" throughout the function.
      
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Reported-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Tested-by: default avatarToralf Förster <toralf.foerster@gmx.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e3b6c655
    • Johannes Weiner's avatar
      fs: buffer: move allocation failure loop into the allocator · 84235de3
      Johannes Weiner authored
      
      
      Buffer allocation has a very crude indefinite loop around waking the
      flusher threads and performing global NOFS direct reclaim because it can
      not handle allocation failures.
      
      The most immediate problem with this is that the allocation may fail due
      to a memory cgroup limit, where flushers + direct reclaim might not make
      any progress towards resolving the situation at all.  Because unlike the
      global case, a memory cgroup may not have any cache at all, only
      anonymous pages but no swap.  This situation will lead to a reclaim
      livelock with insane IO from waking the flushers and thrashing unrelated
      filesystem cache in a tight loop.
      
      Use __GFP_NOFAIL allocations for buffers for now.  This makes sure that
      any looping happens in the page allocator, which knows how to
      orchestrate kswapd, direct reclaim, and the flushers sensibly.  It also
      allows memory cgroups to detect allocations that can't handle failure
      and will allow them to ultimately bypass the limit if reclaim can not
      make progress.
      
      Reported-by: default avatarazurIt <azurit@pobox.sk>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84235de3
    • Johannes Weiner's avatar
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner authored
      
      
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      
      Reported-by: default avatarazurIt <azurit@pobox.sk>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49426420
    • Andrea Arcangeli's avatar
      mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages · ef5a22be
      Andrea Arcangeli authored
      
      
      Commit 11feeb49 ("kvm: optimize away THP checks in
      kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
      compound pages.
      
      That commit depends on the assumption that PG_reserved is identical for
      all head and tail pages of a compound page.  So that if get_user_pages
      returns a tail page, we don't need to check the head page in order to
      know if we deal with a reserved page that requires different
      refcounting.
      
      The assumption that PG_reserved is the same for head and tail pages is
      certainly correct for THP and regular hugepages, but gigantic hugepages
      allocated through bootmem don't clear the PG_reserved on the tail pages
      (the clearing of PG_reserved is done later only if the gigantic hugepage
      is freed).
      
      This patch corrects the gigantic compound page initialization so that we
      can retain the optimization in 11feeb49.  The cacheline was already
      modified in order to set PG_tail so this won't affect the boot time of
      large memory systems.
      
      [akpm@linux-foundation.org: tweak comment layout and grammar]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarandy123 <ajs124.ajs124@gmail.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef5a22be
    • Weijie Yang's avatar
      mm/zswap: bugfix: memory leak when re-swapon · aa9bca05
      Weijie Yang authored
      
      
      zswap_tree is not freed when swapoff, and it got re-kmalloced in swapon,
      so a memory leak occurs.
      
      Free the memory of zswap_tree in zswap_frontswap_invalidate_area().
      
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Reviewed-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      From: Weijie Yang <weijie.yang@samsung.com>
      Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently
      
      Consider the following scenario:
      thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
      thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
      	finished, entry x and its zbud is not freed as its refcount != 0
      	now, the swap_map[x] = 0
      thread 0: now call zswap_get_swap_cache_page
      	swapcache_prepare return -ENOENT because entry x is not used any more
      	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
      	zswap_writeback_entry do nothing except put refcount
      Now, the memory of zswap_entry x and its zpage leak.
      
      Modify:
       - check the refcount in fail path, free memory if it is not referenced.
      
       - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
         can be not only caused by nomem but also by invalidate.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Acked-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa9bca05
    • Cyrill Gorcunov's avatar
      mm: migration: do not lose soft dirty bit if page is in migration state · c3d16e16
      Cyrill Gorcunov authored
      
      
      If page migration is turned on in config and the page is migrating, we
      may lose the soft dirty bit.  If fork and mprotect are called on
      migrating pages (once migration is complete) pages do not obtain the
      soft dirty bit in the correspond pte entries.  Fix it adding an
      appropriate test on swap entries.
      
      Signed-off-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c3d16e16
    • Joonsoo Kim's avatar
      mm/hugetlb.c: correct missing private flag clearing · 16c794b4
      Joonsoo Kim authored
      
      
      We should clear the page's private flag when returing the page to the
      hugepage pool.  Otherwise, marked hugepage can be allocated to the user
      who tries to allocate the non-reserved hugepage.  If this user fail to
      map this hugepage, he would try to return the page to the hugepage pool.
      Since this page has a private flag, resv_huge_pages would mistakenly
      increase.  This patch fixes this situation.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16c794b4
    • Andrew Vagin's avatar
      mm/vmscan.c: don't forget to free shrinker->nr_deferred · ae393321
      Andrew Vagin authored
      
      
      This leak was added by commit 1d3d4437 ("vmscan: per-node deferred
      work").
      
      unreferenced object 0xffff88006ada3bd0 (size 8):
        comm "criu", pid 14781, jiffies 4295238251 (age 105.641s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<ffffffff8170caee>] kmemleak_alloc+0x5e/0xc0
          [<ffffffff811c0527>] __kmalloc+0x247/0x310
          [<ffffffff8117848c>] register_shrinker+0x3c/0xa0
          [<ffffffff811e115b>] sget+0x5ab/0x670
          [<ffffffff812532f4>] proc_mount+0x54/0x170
          [<ffffffff811e1893>] mount_fs+0x43/0x1b0
          [<ffffffff81202dd2>] vfs_kern_mount+0x72/0x110
          [<ffffffff81202e89>] kern_mount_data+0x19/0x30
          [<ffffffff812530a0>] pid_ns_prepare_proc+0x20/0x40
          [<ffffffff81083c56>] alloc_pid+0x466/0x4a0
          [<ffffffff8105aeda>] copy_process+0xc6a/0x1860
          [<ffffffff8105beab>] do_fork+0x8b/0x370
          [<ffffffff8105c1a6>] SyS_clone+0x16/0x20
          [<ffffffff8171f739>] stub_clone+0x69/0x90
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Signed-off-by: default avatarAndrew Vagin <avagin@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ae393321
    • David Rientjes's avatar
      mm, memcg: protect mem_cgroup_read_events for cpu hotplug · 9c567512
      David Rientjes authored
      
      
      for_each_online_cpu() needs the protection of {get,put}_online_cpus() so
      cpu_online_mask doesn't change during the iteration.
      
      cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock
      that is used kernel-wide to synchronize cpu hotplug activity.  Memcg has
      a cpu hotplug notifier, called while there may not be any cpu hotplug
      refcounts, which drains per-cpu event counts to memcg->nocpu_base.events
      to maintain a cumulative event count as cpus disappear.  Without
      get_online_cpus() in mem_cgroup_read_events(), it's possible to account
      for the event count on a dying cpu twice, and this value may be
      significantly large.
      
      In fact, all memcg->pcp_counter_lock use should be nested by
      {get,put}_online_cpus().
      
      This fixes that issue and ensures the reported statistics are not vastly
      over-reported during cpu hotplug.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c567512
  2. Oct 03, 2013
    • Nathan Fontenot's avatar
      powerpc: Fix memory hotplug with sparse vmemmap · f7e3334a
      Nathan Fontenot authored
      
      
      Previous commit 46723bfa... introduced a new config option
      HAVE_BOOTMEM_INFO_NODE that ended up breaking memory hot-remove for ppc
      when sparse vmemmap is not defined.
      
      This patch defines HAVE_BOOTMEM_INFO_NODE for ppc and adds the call to
      register_page_bootmem_info_node. Without this we get a BUG_ON for memory
      hot remove in put_page_bootmem().
      
      This also adds a stub for register_page_bootmem_memmap to allow ppc to build
      with sparse vmemmap defined. Leaving this as a stub is fine since the same
      vmemmap addresses are also handled in vmemmap_populate and as such are
      properly mapped.
      
      Signed-off-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.9+]
      f7e3334a
  3. Sep 30, 2013
  4. Sep 28, 2013
    • Christoph Lameter's avatar
      slab_common: Do not check for duplicate slab names · 3e374919
      Christoph Lameter authored
      
      
      SLUB can alias multiple slab kmem_create_requests to one slab cache to save
      memory and increase the cache hotness. As a result the name of the slab can be
      stale. Only check the name for duplicates if we are in debug mode where we do
      not merge multiple caches.
      
      This fixes the following problem reported by Jonathan Brassow:
      
        The problem with kmem_cache* is this:
      
        *) Assume CONFIG_SLUB is set
        1) kmem_cache_create(name="foo-a")
        - creates new kmem_cache structure
        2) kmem_cache_create(name="foo-b")
        - If identical cache characteristics, it will be merged with the previously
          created cache associated with "foo-a".  The cache's refcount will be
          incremented and an alias will be created via sysfs_slab_alias().
        3) kmem_cache_destroy(<ptr>)
        - Attempting to destroy cache associated with "foo-a", but instead the
          refcount is simply decremented.  I don't even think the sysfs aliases are
          ever removed...
        4) kmem_cache_create(name="foo-a")
        - This FAILS because kmem_cache_sanity_check colides with the existing
          name ("foo-a") associated with the non-removed cache.
      
        This is a problem for RAID (specifically dm-raid) because the name used
        for the kmem_cache_create is ("raid%d-%p", level, mddev).  If the cache
        persists for long enough, the memory address of an old mddev will be
        reused for a new mddev - causing an identical formulation of the cache
        name.  Even though kmem_cache_destory had long ago been used to delete
        the old cache, the merging of caches has cause the name and cache of that
        old instance to be preserved and causes a colision (and thus failure) in
        kmem_cache_create().  I see this regularly in my testing.
      
      Reported-by: default avatarJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: default avatarChristoph Lameter <cl@linux.com>
      Signed-off-by: default avatarPekka Enberg <penberg@kernel.org>
      3e374919
  5. Sep 25, 2013
  6. Sep 12, 2013
Loading