Skip to content
  1. Jun 02, 2017
    • Michal Hocko's avatar
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko authored
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Tested-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
    • James Morse's avatar
      mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified · 9a291a7c
      James Morse authored
      KVM uses get_user_pages() to resolve its stage2 faults.  KVM sets the
      FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
      finds a VM_FAULT_HWPOISON.  KVM handles these hwpoison pages as a
      special case.  (check_user_page_hwpoison())
      
      When huge pages are involved, this doesn't work so well.
      get_user_pages() calls follow_hugetlb_page(), which stops early if it
      receives VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning
      -EFAULT to the caller.  The step to map this to -EHWPOISON based on the
      FOLL_ flags is missing.  The hwpoison special case is skipped, and
      -EFAULT is returned to user-space, causing Qemu or kvmtool to exit.
      
      Instead, move this VM_FAULT_ to errno mapping code into a header file
      and use it from faultin_page() and follow_hugetlb_page().
      
      With this, KVM works as expected.
      
      This isn't a problem for arm64 today as we haven't enabled
      MEMORY_FAILURE, but I can't see any reason this doesn't happen on x86
      too, so I think this should be a fix.  This doesn't apply earlier than
      stable's v4.11.1 due to all sorts of cleanup.
      
      [james.morse@arm.com: add vm_fault_to_errno() call to faultin_page()]
      suggested.
        Link: http://lkml.kernel.org/r/20170525171035.16359-1-james.morse@arm.com
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170524160900.28786-1-james.morse@arm.com
      
      
      Signed-off-by: default avatarJames Morse <james.morse@arm.com>
      Acked-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[4.11.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a291a7c
    • Yisheng Xie's avatar
      mlock: fix mlock count can not decrease in race condition · 70feee0e
      Yisheng Xie authored
      Kefeng reported that when running the follow test, the mlock count in
      meminfo will increase permanently:
      
       [1] testcase
       linux:~ # cat test_mlockal
       grep Mlocked /proc/meminfo
        for j in `seq 0 10`
        do
       	for i in `seq 4 15`
       	do
       		./p_mlockall >> log &
       	done
       	sleep 0.2
       done
       # wait some time to let mlock counter decrease and 5s may not enough
       sleep 5
       grep Mlocked /proc/meminfo
      
       linux:~ # cat p_mlockall.c
       #include <sys/mman.h>
       #include <stdlib.h>
       #include <stdio.h>
      
       #define SPACE_LEN	4096
      
       int main(int argc, char ** argv)
       {
      	 	int ret;
      	 	void *adr = malloc(SPACE_LEN);
      	 	if (!adr)
      	 		return -1;
      
      	 	ret = mlockall(MCL_CURRENT | MCL_FUTURE);
      	 	printf("mlcokall ret = %d\n", ret);
      
      	 	ret = munlockall();
      	 	printf("munlcokall ret = %d\n", ret);
      
      	 	free(adr);
      	 	return 0;
      	 }
      
      In __munlock_pagevec() we should decrement NR_MLOCK for each page where
      we clear the PageMlocked flag.  Commit 1ebb7cc6 ("mm: munlock: batch
      NR_MLOCK zone state updates") has introduced a bug where we don't
      decrement NR_MLOCK for pages where we clear the flag, but fail to
      isolate them from the lru list (e.g.  when the pages are on some other
      cpu's percpu pagevec).  Since PageMlocked stays cleared, the NR_MLOCK
      accounting gets permanently disrupted by this.
      
      Fix it by counting the number of page whose PageMlock flag is cleared.
      
      Fixes: 1ebb7cc6 (" mm: munlock: batch NR_MLOCK zone state updates")
      Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com
      
      
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Tested-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhongjiang <zhongjiang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70feee0e
    • Punit Agrawal's avatar
      mm/migrate: fix refcount handling when !hugepage_migration_supported() · 30809f55
      Punit Agrawal authored
      On failing to migrate a page, soft_offline_huge_page() performs the
      necessary update to the hugepage ref-count.
      
      But when !hugepage_migration_supported() , unmap_and_move_hugepage()
      also decrements the page ref-count for the hugepage.  The combined
      behaviour leaves the ref-count in an inconsistent state.
      
      This leads to soft lockups when running the overcommitted hugepage test
      from mce-tests suite.
      
        Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
        soft offline: 0x83ed600: migration failed 1, type 1fffc00000008008 (uptodate|head)
        INFO: rcu_preempt detected stalls on CPUs/tasks:
         Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
          (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
          thugetlb_overco R  running task        0  2715   2685 0x00000008
          Call trace:
            dump_backtrace+0x0/0x268
            show_stack+0x24/0x30
            sched_show_task+0x134/0x180
            rcu_print_detail_task_stall_rnp+0x54/0x7c
            rcu_check_callbacks+0xa74/0xb08
            update_process_times+0x34/0x60
            tick_sched_handle.isra.7+0x38/0x70
            tick_sched_timer+0x4c/0x98
            __hrtimer_run_queues+0xc0/0x300
            hrtimer_interrupt+0xac/0x228
            arch_timer_handler_phys+0x3c/0x50
            handle_percpu_devid_irq+0x8c/0x290
            generic_handle_irq+0x34/0x50
            __handle_domain_irq+0x68/0xc0
            gic_handle_irq+0x5c/0xb0
      
      Address this by changing the putback_active_hugepage() in
      soft_offline_huge_page() to putback_movable_pages().
      
      This only triggers on systems that enable memory failure handling
      (ARCH_SUPPORTS_MEMORY_FAILURE) but not hugepage migration
      (!ARCH_ENABLE_HUGEPAGE_MIGRATION).
      
      I imagine this wasn't triggered as there aren't many systems running
      this configuration.
      
      [akpm@linux-foundation.org: remove dead comment, per Naoya]
      Link: http://lkml.kernel.org/r/20170525135146.32011-1-punit.agrawal@arm.com
      
      
      Reported-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Tested-by: default avatarManoj Iyer <manoj.iyer@canonical.com>
      Suggested-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarPunit Agrawal <punit.agrawal@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[3.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      30809f55
    • Ross Zwisler's avatar
      mm: avoid spurious 'bad pmd' warning messages · d0f0931d
      Ross Zwisler authored
      When the pmd_devmap() checks were added by 5c7fb56e ("mm, dax:
      dax-pmd vs thp-pmd vs hugetlbfs-pmd") to add better support for DAX huge
      pages, they were all added to the end of if() statements after existing
      pmd_trans_huge() checks.  So, things like:
      
        -       if (pmd_trans_huge(*pmd))
        +       if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
      
      When further checks were added after pmd_trans_unstable() checks by
      commit 7267ec00 ("mm: postpone page table allocation until we have
      page to map") they were also added at the end of the conditional:
      
        +       if (pmd_trans_unstable(fe->pmd) || pmd_devmap(*fe->pmd))
      
      This ordering is fine for pmd_trans_huge(), but doesn't work for
      pmd_trans_unstable().  This is because DAX huge pages trip the bad_pmd()
      check inside of pmd_none_or_trans_huge_or_clear_bad() (called by
      pmd_trans_unstable()), which prints out a warning and returns 1.  So, we
      do end up doing the right thing, but only after spamming dmesg with
      suspicious looking messages:
      
        mm/pgtable-generic.c:39: bad pmd ffff8808daa49b88(84000001006000a5)
      
      Reorder these checks in a helper so that pmd_devmap() is checked first,
      avoiding the error messages, and add a comment explaining why the
      ordering is important.
      
      Fixes: commit 7267ec00 ("mm: postpone page table allocation until we have page to map")
      Link: http://lkml.kernel.org/r/20170522215749.23516-1-ross.zwisler@linux.intel.com
      
      
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d0f0931d
    • Tetsuo Handa's avatar
      mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once · c288983d
      Tetsuo Handa authored
      Roman Gushchin has reported that the OOM killer can trivially selects
      next OOM victim when a thread doing memory allocation from page fault
      path was selected as first OOM victim.
      
          allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           __alloc_pages_slowpath+0xc84/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
          Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
          allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           __alloc_pages_slowpath+0xd32/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          ...
          allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           pagefault_out_of_memory+0x68/0x80
           mm_fault_error+0x8f/0x190
           ? handle_mm_fault+0xf3/0x210
           __do_page_fault+0x4b2/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
          Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
      
      There is a race window that the OOM reaper completes reclaiming the
      first victim's memory while nothing but mutex_trylock() prevents the
      first victim from calling out_of_memory() from pagefault_out_of_memory()
      after memory allocation for page fault path failed due to being selected
      as an OOM victim.
      
      This is a side effect of commit 9a67f648 ("mm: consolidate
      GFP_NOFAIL checks in the allocator slowpath") because that commit
      silently changed the behavior from
      
          /* Avoid allocations with no watermarks from looping endlessly */
      
      to
      
          /*
           * Give up allocations without trying memory reserves if selected
           * as an OOM victim
           */
      
      in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
      flag.  I have noticed this change but I didn't post a patch because I
      thought it is an acceptable change other than noise by warn_alloc()
      because !__GFP_NOFAIL allocations are allowed to fail.  But we
      overlooked that failing memory allocation from page fault path makes
      difference due to the race window explained above.
      
      While it might be possible to add a check to pagefault_out_of_memory()
      that prevents the first victim from calling out_of_memory() or remove
      out_of_memory() from pagefault_out_of_memory(), changing
      pagefault_out_of_memory() does not suppress noise by warn_alloc() when
      allocating thread was selected as an OOM victim.  There is little point
      with printing similar backtraces and memory information from both
      out_of_memory() and warn_alloc().
      
      Instead, if we guarantee that current thread can try allocations with no
      watermarks once when current thread looping inside
      __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
      can use memory reserves" rules and suppress noise by warn_alloc() and
      prevent memory allocations from page fault path from calling
      pagefault_out_of_memory().
      
      If we take the comment literally, this patch would do
      
        -    if (test_thread_flag(TIF_MEMDIE))
        -        goto nopage;
        +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
        +        goto nopage;
      
      because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
      given.  But if I recall correctly (I couldn't find the message), the
      condition is meant to apply to only OOM victims despite the comment.
      Therefore, this patch preserves TIF_MEMDIE check.
      
      Fixes: 9a67f648 ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
      Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: default avatarRoman Gushchin <guro@fb.com>
      Tested-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.11]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c288983d
    • Thomas Gleixner's avatar
      slub/memcg: cure the brainless abuse of sysfs attributes · 478fe303
      Thomas Gleixner authored
      memcg_propagate_slab_attrs() abuses the sysfs attribute file functions
      to propagate settings from the root kmem_cache to a newly created
      kmem_cache.  It does that with:
      
           attr->show(root, buf);
           attr->store(new, buf, strlen(bug);
      
      Aside of being a lazy and absurd hackery this is broken because it does
      not check the return value of the show() function.
      
      Some of the show() functions return 0 w/o touching the buffer.  That
      means in such a case the store function is called with the stale content
      of the previous show().  That causes nonsense like invoking
      kmem_cache_shrink() on a newly created kmem_cache.  In the worst case it
      would cause handing in an uninitialized buffer.
      
      This should be rewritten proper by adding a propagate() callback to
      those slub_attributes which must be propagated and avoid that insane
      conversion to and from ASCII, but that's too large for a hot fix.
      
      Check at least the return value of the show() function, so calling
      store() with stale content is prevented.
      
      Steven said:
       "It can cause a deadlock with get_online_cpus() that has been uncovered
        by recent cpu hotplug and lockdep changes that Thomas and Peter have
        been doing.
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(cpu_hotplug.lock);
                                         lock(slab_mutex);
                                         lock(cpu_hotplug.lock);
            lock(slab_mutex);
      
           *** DEADLOCK ***"
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705201244540.2255@nanos
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      478fe303
    • Michal Hocko's avatar
      mm: clarify why we want kmalloc before falling backto vmallock · 4f4f2ba9
      Michal Hocko authored
      While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
      Wilson has wondered why we want to try kmalloc before vmalloc fallback
      even for larger allocations requests.  Let's clarify that one larger
      physically contiguous block is less likely to fragment memory than many
      scattered pages which can prevent more large blocks from being created.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: default avatarChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f4f2ba9
    • Andrea Arcangeli's avatar
      ksm: prevent crash after write_protect_page fails · a7306c34
      Andrea Arcangeli authored
      "err" needs to be left set to -EFAULT if split_huge_page succeeds.
      Otherwise if "err" gets clobbered with zero and write_protect_page
      fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
      and then try_to_merge_with_ksm_page() will continue thinking kpage is a
      PageKsm when in fact it's still an anonymous page.  Eventually it'll
      crash in page_add_anon_rmap.
      
      This has been reproduced on Fedora25 kernel but I can reproduce with
      upstream too.
      
      The bug was introduced in commit f765f540 ("ksm: prepare to new THP
      semantics") introduced in v4.5.
      
          page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
          flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffffa09674bf0000
          ------------[ cut here ]------------
          kernel BUG at mm/rmap.c:1222!
          CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
          RIP: do_page_add_anon_rmap+0x1c4/0x240
          Call Trace:
            page_add_anon_rmap+0x18/0x20
            try_to_merge_with_ksm_page+0x50b/0x780
            ksm_scan_thread+0x1211/0x1410
            ? prepare_to_wait_event+0x100/0x100
            ? try_to_merge_with_ksm_page+0x780/0x780
            kthread+0xd9/0xf0
            ? kthread_park+0x60/0x60
            ret_from_fork+0x25/0x30
      
      Fixes: f765f540 ("ksm: prepare to new THP semantics")
      Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarFederico Simoncelli <fsimonce@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7306c34
  2. May 12, 2017
    • Minchan Kim's avatar
      mm: vmscan: scan until it finds eligible pages · 791b48b6
      Minchan Kim authored
      Although there are a ton of free swap and anonymous LRU page in elgible
      zones, OOM happened.
      
        balloon invoked oom-killer: gfp_mask=0x17080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO|__GFP_NOTRACK), nodemask=(null),  order=0, oom_score_adj=0
        CPU: 7 PID: 1138 Comm: balloon Not tainted 4.11.0-rc6-mm1-zram-00289-ge228d67e9677-dirty #17
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
         oom_kill_process+0x21d/0x3f0
         out_of_memory+0xd8/0x390
         __alloc_pages_slowpath+0xbc1/0xc50
         __alloc_pages_nodemask+0x1a5/0x1c0
         pte_alloc_one+0x20/0x50
         __pte_alloc+0x1e/0x110
         __handle_mm_fault+0x919/0x960
         handle_mm_fault+0x77/0x120
         __do_page_fault+0x27a/0x550
         trace_do_page_fault+0x43/0x150
         do_async_page_fault+0x2c/0x90
         async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:424716 inactive_anon:65314 isolated_anon:0
         active_file:52 inactive_file:46 isolated_file:0
         unevictable:0 dirty:27 writeback:0 unstable:0
         slab_reclaimable:3967 slab_unreclaimable:4125
         mapped:133 shmem:43 pagetables:1674 bounce:0
         free:4637 free_pcp:225 free_cma:0
        Node 0 active_anon:1698864kB inactive_anon:261256kB active_file:208kB inactive_file:184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:532kB dirty:108kB writeback:0kB shmem:172kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        DMA free:7316kB min:32kB low:44kB high:56kB active_anon:8064kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:464kB slab_unreclaimable:40kB kernel_stack:0kB pagetables:24kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 992 992 1952
        DMA32 free:9088kB min:2048kB low:3064kB high:4080kB active_anon:952176kB inactive_anon:0kB active_file:36kB inactive_file:0kB unevictable:0kB writepending:88kB present:1032192kB managed:1019388kB mlocked:0kB slab_reclaimable:13532kB slab_unreclaimable:16460kB kernel_stack:3552kB pagetables:6672kB bounce:0kB free_pcp:56kB local_pcp:24kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 959
        Movable free:3644kB min:1980kB low:2960kB high:3940kB active_anon:738560kB inactive_anon:261340kB active_file:188kB inactive_file:640kB unevictable:0kB writepending:20kB present:1048444kB managed:1010816kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:832kB local_pcp:60kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 1*4kB (E) 0*8kB 18*16kB (E) 10*32kB (E) 10*64kB (E) 9*128kB (ME) 8*256kB (E) 2*512kB (E) 2*1024kB (E) 0*2048kB 0*4096kB = 7524kB
        DMA32: 417*4kB (UMEH) 181*8kB (UMEH) 68*16kB (UMEH) 48*32kB (UMEH) 14*64kB (MH) 3*128kB (M) 1*256kB (H) 1*512kB (M) 2*1024kB (M) 0*2048kB 0*4096kB = 9836kB
        Movable: 1*4kB (M) 1*8kB (M) 1*16kB (M) 1*32kB (M) 0*64kB 1*128kB (M) 2*256kB (M) 4*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3772kB
        378 total pagecache pages
        17 pages in swap cache
        Swap cache stats: add 17325, delete 17302, find 0/27
        Free swap  = 978940kB
        Total swap = 1048572kB
        524157 pages RAM
        0 pages HighMem/MovableOnly
        12629 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
        [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
        [  433]     0   433     4904        5      14       3       82             0 upstart-udev-br
        [  438]     0   438    12371        5      27       3      191         -1000 systemd-udevd
      
      With investigation, skipping page of isolate_lru_pages makes reclaim
      void because it returns zero nr_taken easily so LRU shrinking is
      effectively nothing and just increases priority aggressively.  Finally,
      OOM happens.
      
      The problem is that get_scan_count determines nr_to_scan with eligible
      zones so although priority drops to zero, it couldn't reclaim any pages
      if the LRU contains mostly ineligible pages.
      
      get_scan_count:
      
              size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
      	size = size >> sc->priority;
      
      Assumes sc->priority is 0 and LRU list is as follows.
      
      	N-N-N-N-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H-H
      
      (Ie, small eligible pages are in the head of LRU but others are
       almost ineligible pages)
      
      In that case, size becomes 4 so VM want to scan 4 pages but 4 pages from
      tail of the LRU are not eligible pages.  If get_scan_count counts
      skipped pages, it doesn't reclaim any pages remained after scanning 4
      pages so it ends up OOM happening.
      
      This patch makes isolate_lru_pages try to scan pages until it encounters
      eligible zones's pages.
      
      [akpm@linux-foundation.org: clean up mind-bending `for' statement.  Tweak comment text]
      Fixes: 3db65812 ("Revert "mm, vmscan: account for skipped pages as a partial scan"")
      Link: http://lkml.kernel.org/r/1494457232-27401-1-git-send-email-minchan@kernel.org
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      791b48b6
    • David Rientjes's avatar
      mm, thp: copying user pages must schedule on collapse · 338a16ba
      David Rientjes authored
      We have encountered need_resched warnings in __collapse_huge_page_copy()
      while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.
      
      mm->mmap_sem is held for write, but the iteration is well bounded.
      
      Reschedule as needed.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
      
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      338a16ba
    • Jan Kara's avatar
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara authored
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.cz
      
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • Ross Zwisler's avatar
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler authored
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.cz
      
      
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
    • Michal Hocko's avatar
      mm, vmalloc: fix vmalloc users tracking properly · 8594a21c
      Michal Hocko authored
      Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has
      pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
      turned out to be a bad idea for some architectures.  E.g.  m68k fails
      with
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                          from arch/m68k/include/asm/pgtable.h:4,
                          from include/linux/vmalloc.h:9,
                          from arch/m68k/kernel/module.c:9:
         arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
      >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
          #define pgd_offset_k(address) pgd_offset(&init_mm, address)
      
      as spotted by kernel build bot. nios2 fails for other reason
      
        In file included from include/asm-generic/io.h:767:0,
                         from arch/nios2/include/asm/io.h:61,
                         from include/linux/io.h:25,
                         from arch/nios2/include/asm/pgtable.h:18,
                         from include/linux/mm.h:70,
                         from include/linux/pid_namespace.h:6,
                         from include/linux/ptrace.h:9,
                         from arch/nios2/include/uapi/asm/elf.h:23,
                         from arch/nios2/include/asm/elf.h:22,
                         from include/linux/elf.h:4,
                         from include/linux/module.h:15,
                         from init/main.c:16:
        include/linux/vmalloc.h: In function '__vmalloc_node_flags':
        include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?
      
      which is due to the newly added #include <asm/pgtable.h>, which on nios2
      includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
      again includes <linux/vmalloc.h>.
      
      Tweaking that around just turns out a bigger headache than necessary.
      This patch reverts 1f5307b1 and reimplements the original fix in a
      different way.  __vmalloc_node_flags can stay static inline which will
      cover vmalloc* functions.  We only have one external user
      (kvmalloc_node) and we can export __vmalloc_node_flags_caller and
      provide the caller directly.  This is much simpler and it doesn't really
      need any games with header files.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [mhocko@kernel.org: revert old comment]
        Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
      Fixes: 1f5307b1 ("mm, vmalloc: properly track vmalloc users")
      Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Tobias Klauser <tklauser@distanz.ch>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8594a21c
    • SeongJae Park's avatar
      mm/khugepaged: add missed tracepoint for collapse_huge_page_swapin · 835152a2
      SeongJae Park authored
      One return case of `__collapse_huge_page_swapin()` does not invoke
      tracepoint while every other return case does.  This commit adds a
      tracepoint invocation for the case.
      
      Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
      
      
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      835152a2
    • Reza Arbab's avatar
      mm, vmstat: Remove spurious WARN() during zoneinfo print · 8d35bb31
      Reza Arbab authored
      After commit e2ecc8a7 ("mm, vmstat: print non-populated zones in
      zoneinfo"), /proc/zoneinfo will show unpopulated zones.
      
      A memoryless node, having no populated zones at all, was previously
      ignored, but will now trigger the WARN() in is_zone_first_populated().
      
      Remove this warning, as its only purpose was to warn of a situation that
      has since been enabled.
      
      Aside: The "per-node stats" are still printed under the first populated
      zone, but that's not necessarily the first stanza any more.  I'm not
      sure which criteria is more important with regard to not breaking
      parsers, but it looks a little weird to the eye.
      
      Fixes:  e2ecc8a7 ("mm, vmstat: print node-based stats in zoneinfo file")
      Link: http://lkml.kernel.org/r/1493854905-10918-1-git-send-email-arbab@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d35bb31
    • Michal Hocko's avatar
      hwpoison, memcg: forcibly uncharge LRU pages · 18365225
      Michal Hocko authored
      Laurent Dufour has noticed that hwpoinsoned pages are kept charged.  In
      his particular case he has hit a bad_page("page still charged to
      cgroup") when onlining a hwpoison page.  While this looks like something
      that shouldn't happen in the first place because onlining hwpages and
      returning them to the page allocator makes only little sense it shows a
      real problem.
      
      hwpoison pages do not get freed usually so we do not uncharge them (at
      least not since commit 0a31bc97 ("mm: memcontrol: rewrite uncharge
      API")).  Each charge pins memcg (since e8ea14cc ("mm: memcontrol:
      take a css reference for each charged page")) as well and so the
      mem_cgroup and the associated state will never go away.  Fix this leak
      by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache().
      We also have to tweak uncharge_list because it cannot rely on zero ref
      count for these pages.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: 0a31bc97 ("mm: memcontrol: rewrite uncharge API")
      Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18365225
  3. May 11, 2017
  4. May 09, 2017
    • Vlastimil Babka's avatar
      mm: introduce memalloc_noreclaim_{save,restore} · 499118e9
      Vlastimil Babka authored
      The previous patch ("mm: prevent potential recursive reclaim due to
      clearing PF_MEMALLOC") has shown that simply setting and clearing
      PF_MEMALLOC in current->flags can result in wrongly clearing a
      pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
      Let's introduce helpers that support proper nesting by saving the
      previous stat of the flag, similar to the existing memalloc_noio_* and
      memalloc_nofs_* helpers.  Convert existing setting/clearing of
      PF_MEMALLOC within mm to the new helpers.
      
      There are no known issues with the converted code, but the change makes
      it more robust.
      
      Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      499118e9
    • Vlastimil Babka's avatar
      mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC · 62be1511
      Vlastimil Babka authored
      Patch series "more robust PF_MEMALLOC handling"
      
      This series aims to unify the setting and clearing of PF_MEMALLOC, which
      prevents recursive reclaim.  There are some places that clear the flag
      unconditionally from current->flags, which may result in clearing a
      pre-existing flag.  This already resulted in a bug report that Patch 1
      fixes (without the new helpers, to make backporting easier).  Patch 2
      introduces the new helpers, modelled after existing memalloc_noio_* and
      memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
      and 4 convert non-mm code.
      
      This patch (of 4):
      
      __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
      during page migration by lock_page() (see the comment in
      __unmap_and_move()).  Then it unconditionally clears the flag, which can
      clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
      This was not a problem until commit a8161d1e ("mm, page_alloc:
      restructure direct compaction handling in slowpath"), because direct
      compation was called only after direct reclaim, which was skipped when
      PF_MEMALLOC flag was set.
      
      Even now it's only a theoretical issue, as the new callsite of
      __alloc_pages_direct_compact() is reached only for costly orders and
      when gfp_pfmemalloc_allowed() is true, which means either
      __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
      such known context, but let's play it safe and make
      __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
      already set.
      
      Fixes: a8161d1e ("mm, page_alloc: restructure direct compaction handling in slowpath")
      Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      62be1511
    • Oliver O'Halloran's avatar
      mm/huge_memory.c: deposit a pgtable for DAX PMD faults when required · 3b6521f5
      Oliver O'Halloran authored
      Although all architectures use a deposited page table for THP on
      anonymous VMAs, some architectures (s390 and powerpc) require the
      deposited storage even for file backed VMAs due to quirks of their MMUs.
      
      This patch adds support for depositing a table in DAX PMD fault handling
      path for archs that require it.  Other architectures should see no
      functional changes.
      
      Link: http://lkml.kernel.org/r/20170411174233.21902-3-oohall@gmail.com
      
      
      Signed-off-by: default avatarOliver O'Halloran <oohall@gmail.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: linux-nvdimm@ml01.01.org
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b6521f5
    • Oliver O'Halloran's avatar
      mm/huge_memory.c: use zap_deposited_table() more · c14a6eb4
      Oliver O'Halloran authored
      Depending on the flags of the PMD being zapped there may or may not be a
      deposited pgtable to be freed.  In two of the three cases this is open
      coded while the third uses the zap_deposited_table() helper.  This patch
      converts the others to use the helper to clean things up a bit.
      
      Link: http://lkml.kernel.org/r/20170411174233.21902-2-oohall@gmail.com
      
      
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: linux-nvdimm@ml01.01.org
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c14a6eb4
    • Tetsuo Handa's avatar
      fs: semove set but not checked AOP_FLAG_UNINTERRUPTIBLE flag · c718a975
      Tetsuo Handa authored
      Commit afddba49 ("fs: introduce write_begin, write_end, and
      perform_write aops") introduced AOP_FLAG_UNINTERRUPTIBLE flag which was
      checked in pagecache_write_begin(), but that check was removed by
      4e02ed4b ("fs: remove prepare_write/commit_write").
      
      Between these two commits, commit d9414774 ("cifs: Convert cifs to
      new aops.") added a check in cifs_write_begin(), but that check was soon
      removed by commit a98ee8c1 ("[CIFS] fix regression in
      cifs_write_begin/cifs_write_end").
      
      Therefore, AOP_FLAG_UNINTERRUPTIBLE flag is checked nowhere.  Let's
      remove this flag.  This patch has no functionality changes.
      
      Link: http://lkml.kernel.org/r/1489294781-53494-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      
      
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c718a975
    • Michal Hocko's avatar
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko authored
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • Huang Ying's avatar
      mm, swap: use kvzalloc to allocate some swap data structures · 54f180d3
      Huang Ying authored
      Now vzalloc() is used in swap code to allocate various data structures,
      such as swap cache, swap slots cache, cluster info, etc.  Because the
      size may be too large on some system, so that normal kzalloc() may fail.
      But using kzalloc() has some advantages, for example, less memory
      fragmentation, less TLB pressure, etc.  So change the data structure
      allocation in swap code to use kvzalloc() which will try kzalloc()
      firstly, and fallback to vzalloc() if kzalloc() failed.
      
      In general, although kmalloc() will reduce the number of high-order
      pages in short term, vmalloc() will cause more pain for memory
      fragmentation in the long term.  And the swap data structure allocation
      that is changed in this patch is expected to be long term allocation.
      
      From Dave Hansen:
       "for example, we have a two-page data structure. vmalloc() takes two
        effectively random order-0 pages, probably from two different 2M pages
        and pins them. That "kills" two 2M pages. kmalloc(), allocating two
        *contiguous* pages, will not cross a 2M boundary. That means it will
        only "kill" the possibility of a single 2M page. More 2M pages == less
        fragmentation.
      
      The allocation in this patch occurs during swap on time, which is
      usually done during system boot, so usually we have high opportunity to
      allocate the contiguous pages successfully.
      
      The allocation for swap_map[] in struct swap_info_struct is not changed,
      because that is usually quite large and vmalloc_to_page() is used for
      it.  That makes it a little harder to change.
      
      Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
      
      
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      Acked-by: default avatarTim Chen <tim.c.chen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f180d3
    • Michal Hocko's avatar
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko authored
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
    • Michal Hocko's avatar
      mm: support __GFP_REPEAT in kvmalloc_node for >32kB · 6c5ab651
      Michal Hocko authored
      vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
      vhost_vsock because it would really like to prefer kmalloc to the
      vmalloc fallback - see 23cc5a99 ("vhost-net: extend device
      allocation to vmalloc") for more context.  Michael Tsirkin has also
      noted:
      
       "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
        all accesses are slowed down. Allocation is not on data path, accesses
        are."
      
      The similar applies to other vhost_kvzalloc users.
      
      Let's teach kvmalloc_node to handle __GFP_REPEAT properly.  There are
      two things to be careful about.  First we should prevent from the OOM
      killer and so have to involve __GFP_NORETRY by default and secondly
      override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
      ignored for !costly orders.
      
      Supporting __GFP_REPEAT like semantic for !costly request is possible it
      would require changes in the page allocator.  This is out of scope of
      this patch.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c5ab651
    • Michal Hocko's avatar
      mm, vmalloc: properly track vmalloc users · 1f5307b1
      Michal Hocko authored
      __vmalloc_node_flags used to be static inline but this has changed by
      "mm: introduce kv[mz]alloc helpers" because kvmalloc_node needs to use
      it as well and the code is outside of the vmalloc proper.  I haven't
      realized that changing this will lead to a subtle bug though.  The
      function is responsible to track the caller as well.  This caller is
      then printed by /proc/vmallocinfo.  If __vmalloc_node_flags is not
      inline then we would get only direct users of __vmalloc_node_flags as
      callers (e.g.  v[mz]alloc) which reduces usefulness of this debugging
      feature considerably.  It simply doesn't help to see that the given
      range belongs to vmalloc as a caller:
      
        0xffffc90002c79000-0xffffc90002c7d000   16384 vmalloc+0x16/0x18 pages=3 vmalloc N0=3
        0xffffc90002c81000-0xffffc90002c85000   16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
        0xffffc90002c8d000-0xffffc90002c91000   16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
        0xffffc90002c95000-0xffffc90002c99000   16384 vmalloc+0x16/0x18 pages=3 vmalloc N1=3
      
      We really want to catch the _caller_ of the vmalloc function.  Fix this
      issue by making __vmalloc_node_flags static inline again.
      
      Link: http://lkml.kernel.org/r/20170502134657.12381-1-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f5307b1
    • Michal Hocko's avatar
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko authored
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
    • Vlastimil Babka's avatar
      mm, compaction: finish whole pageblock to reduce fragmentation · baf6a9a1
      Vlastimil Babka authored
      The main goal of direct compaction is to form a high-order page for
      allocation, but it should also help against long-term fragmentation when
      possible.
      
      Most lower-than-pageblock-order compactions are for non-movable
      allocations, which means that if we compact in a movable pageblock and
      terminate as soon as we create the high-order page, it's unlikely that
      the fallback heuristics will claim the whole block.  Instead there might
      be a single unmovable page in a pageblock full of movable pages, and the
      next unmovable allocation might pick another pageblock and increase
      long-term fragmentation.
      
      To help against such scenarios, this patch changes the termination
      criteria for compaction so that the current pageblock is finished even
      though the high-order page already exists.  Note that it might be
      possible that the high-order page formed elsewhere in the zone due to
      parallel activity, but this patch doesn't try to detect that.
      
      This is only done with sync compaction, because async compaction is
      limited to pageblock of the same migratetype, where it cannot result in
      a migratetype fallback.  (Async compaction also eagerly skips
      order-aligned blocks where isolation fails, which is against the goal of
      migrating away as much of the pageblock as possible.)
      
      As a result of this patch, long-term memory fragmentation should be
      reduced.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 20%.  The number
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      baf6a9a1
    • Vlastimil Babka's avatar
      mm, compaction: restrict async compaction to pageblocks of same migratetype · 282722b0
      Vlastimil Babka authored
      The migrate scanner in async compaction is currently limited to
      MIGRATE_MOVABLE pageblocks.  This is a heuristic intended to reduce
      latency, based on the assumption that non-MOVABLE pageblocks are
      unlikely to contain movable pages.
      
      However, with the exception of THP's, most high-order allocations are
      not movable.  Should the async compaction succeed, this increases the
      chance that the non-MOVABLE allocations will fallback to a MOVABLE
      pageblock, making the long-term fragmentation worse.
      
      This patch attempts to help the situation by changing async direct
      compaction so that the migrate scanner only scans the pageblocks of the
      requested migratetype.  If it's a non-MOVABLE type and there are such
      pageblocks that do contain movable pages, chances are that the
      allocation can succeed within one of such pageblocks, removing the need
      for a fallback.  If that fails, the subsequent sync attempt will ignore
      this restriction.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 30%.  The number of movable allocations falling back is reduced by
      12%.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      282722b0
    • Vlastimil Babka's avatar
      mm, compaction: add migratetype to compact_control · d39773a0
      Vlastimil Babka authored
      Preparation patch.  We are going to need migratetype at lower layers
      than compact_zone() and compact_finished().
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d39773a0
    • Vlastimil Babka's avatar
      mm, compaction: change migrate_async_suitable() to suitable_migration_source() · b682debd
      Vlastimil Babka authored
      Preparation for making the decisions more complex and depending on
      compact_control flags.  No functional change.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-6-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b682debd
    • Vlastimil Babka's avatar
      mm, page_alloc: count movable pages when stealing from pageblock · 02aa0cdd
      Vlastimil Babka authored
      When stealing pages from pageblock of a different migratetype, we count
      how many free pages were stolen, and change the pageblock's migratetype
      if more than half of the pageblock was free.  This might be too
      conservative, as there might be other pages that are not free, but were
      allocated with the same migratetype as our allocation requested.
      
      While we cannot determine the migratetype of allocated pages precisely
      (at least without the page_owner functionality enabled), we can count
      pages that compaction would try to isolate for migration - those are
      either on LRU or __PageMovable().  The rest can be assumed to be
      MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
      distinguish.  This counting can be done as part of free page stealing
      with little additional overhead.
      
      The page stealing code is changed so that it considers free pages plus
      pages of the "good" migratetype for the decision whether to change
      pageblock's migratetype.
      
      The result should be more accurate migratetype of pageblocks wrt the
      actual pages in the pageblocks, when stealing from semi-occupied
      pageblocks.  This should help the efficiency of page grouping by
      mobility.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 47%.  The number of movable allocations falling back to other
      pageblocks are increased by 55%, but these events don't cause permanent
      fragmentation, so the tradeoff should be positive.  Later patches also
      offset the movable fallback increase to some extent.
      
      [akpm@linux-foundation.org: merge fix]
      Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      02aa0cdd
    • Vlastimil Babka's avatar
      mm, page_alloc: split smallest stolen page in fallback · 3bc48f96
      Vlastimil Babka authored
      The __rmqueue_fallback() function is called when there's no free page of
      requested migratetype, and we need to steal from a different one.
      
      There are various heuristics to make this event infrequent and reduce
      permanent fragmentation.  The main one is to try stealing from a
      pageblock that has the most free pages, and possibly steal them all at
      once and convert the whole pageblock.  Precise searching for such
      pageblock would be expensive, so instead the heuristics walks the free
      lists from MAX_ORDER down to requested order and assumes that the block
      with highest-order free page is likely to also have the most free pages
      in total.
      
      Chances are that together with the highest-order page, we steal also
      pages of lower orders from the same block.  But then we still split the
      highest order page.  This is wasteful and can contribute to
      fragmentation instead of avoiding it.
      
      This patch thus changes __rmqueue_fallback() to just steal the page(s)
      and put them on the freelist of the requested migratetype, and only
      report whether it was successful.  Then we pick (and eventually split)
      the smallest page with __rmqueue_smallest().  This all happens under
      zone lock, so nobody can steal it from us in the process.  This should
      reduce fragmentation due to fallbacks.  At worst we are only stealing a
      single highest-order page and waste some cycles by moving it between
      lists and then removing it, but fallback is not exactly hot path so that
      should not be a concern.  As a side benefit the patch removes some
      duplicate code by reusing __rmqueue_smallest().
      
      [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
        Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
      Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3bc48f96
    • Vlastimil Babka's avatar
      mm, compaction: remove redundant watermark check in compact_finished() · 228d7e33
      Vlastimil Babka authored
      When detecting whether compaction has succeeded in forming a high-order
      page, __compact_finished() employs a watermark check, followed by an own
      search for a suitable page in the freelists.  This is not ideal for two
      reasons:
      
       - The watermark check also searches high-order freelists, but has a
         less strict criteria wrt fallback. It's therefore redundant and waste
         of cycles. This was different in the past when high-order watermark
         check attempted to apply reserves to high-order pages.
      
       - The watermark check might actually fail due to lack of order-0 pages.
         Compaction can't help with that, so there's no point in continuing
         because of that. It's possible that high-order page still exists and
         it terminates.
      
      This patch therefore removes the watermark check.  This should save some
      cycles and terminate compaction sooner in some cases.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-3-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      228d7e33
    • Vlastimil Babka's avatar
      mm, compaction: reorder fields in struct compact_control · f25ba6dc
      Vlastimil Babka authored
      Patch series "try to reduce fragmenting fallbacks", v3.
      
      Last year, Johannes Weiner has reported a regression in page mobility
      grouping [1] and while the exact cause was not found, I've come up with
      some ways to improve it by reducing the number of allocations falling
      back to different migratetype and causing permanent fragmentation.
      
      The series was tested with mmtests stress-highalloc modified to do
      GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
      balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
      wasn't woken up) on UMA machine with 4GB memory.  There were 5 repeats
      of each run, as the extfrag stats are quite volatile (note the stats
      below are sums, not averages, as it was less perl hacking for me).
      
      Success rate are the same, already high due to the low allocation order
      used, so I'm not including them.
      
      Compaction stats:
      (the patches are stacked, and I haven't measured the non-functional-changes
      patches separately)
      
                                           patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
        Compaction stalls                    22449       24680       24846       19765       22059       17480
        Compaction success                   12971       14836       14608       10475       11632        8757
        Compaction failures                   9477        9843       10238        9290       10426        8722
        Page migrate success               3109022     3370438     3312164     1695105     1608435     2111379
        Page migrate failure                911588     1149065     1028264     1112675     1077251     1026367
        Compaction pages isolated          7242983     8015530     7782467     4629063     4402787     5377665
        Compaction migrate scanned       980838938   987367943   957690188   917647238   947155598  1018922197
        Compaction free scanned          557926893   598946443   602236894   594024490   541169699   763651731
        Compaction cost                      10243       10578       10304        8286        8398        9440
      
      Compaction stats are mostly within noise until patch 4, which decreases
      the number of compactions, and migrations.  Part of that could be due to
      more pageblocks marked as unmovable, and async compaction skipping
      those.  This changes a bit with patch 7, but not so much.  Patch 8
      increases free scanner stats and migrations, which comes from the
      changed termination criteria.  Interestingly number of compactions
      decreases - probably the fully compacted pageblock satisfies multiple
      subsequent allocations, so it amortizes.
      
      Next comes the extfrag tracepoint, where "fragmenting" means that an
      allocation had to fallback to a pageblock of another migratetype which
      wasn't fully free (which is almost all of the fallbacks).  I have
      locally added another tracepoint for "Page steal" into
      steal_suitable_fallback() which triggers in situations where we are
      allowed to do move_freepages_block().  If we decide to also do
      set_pageblock_migratetype(), it's "Pages steal with pageblock" with
      break down for which allocation migratetype we are stealing and from
      which fallback migratetype.  The last part "due to counting" comes from
      patch 4 and counts the events where the counting of movable pages
      allowed us to change pageblock's migratetype, while the number of free
      pages alone wouldn't be enough to cross the threshold.
      
                                                             patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
        Page alloc extfrag event                            10155066     8522968    10164959    15622080    13727068    13140319
        Extfrag fragmenting                                 10149231     8517025    10159040    15616925    13721391    13134792
        Extfrag fragmenting for unmovable                     159504      168500      184177       97835       70625       56948
        Extfrag fragmenting unmovable placed with movable     153613      163549      172693       91740       64099       50917
        Extfrag fragmenting unmovable placed with reclaim.      5891        4951       11484        6095        6526        6031
        Extfrag fragmenting for reclaimable                     4738        4829        6345        4822        5640        5378
        Extfrag fragmenting reclaimable placed with movable     1836        1902        1851        1579        1739        1760
        Extfrag fragmenting reclaimable placed with unmov.      2902        2927        4494        3243        3901        3618
        Extfrag fragmenting for movable                      9984989     8343696     9968518    15514268    13645126    13072466
        Pages steal                                           179954      192291      210880      123254       94545       81486
        Pages steal with pageblock                             22153       18943       20154       33562       29969       33444
        Pages steal with pageblock for unmovable               14350       12858       13256       20660       19003       20852
        Pages steal with pageblock for unmovable from mov.     12812       11402       11683       19072       17467       19298
        Pages steal with pageblock for unmovable from recl.     1538        1456        1573        1588        1536        1554
        Pages steal with pageblock for movable                  7114        5489        5965       11787       10012       11493
        Pages steal with pageblock for movable from unmov.      6885        5291        5541       11179        9525       10885
        Pages steal with pageblock for movable from recl.        229         198         424         608         487         608
        Pages steal with pageblock for reclaimable               689         596         933        1115         954        1099
        Pages steal with pageblock for reclaimable from unmov.   273         219         537         658         547         667
        Pages steal with pageblock for reclaimable from mov.     416         377         396         457         407         432
        Pages steal with pageblock due to counting                                                 11834       10075        7530
        ... for unmovable                                                                           8993        7381        4616
        ... for movable                                                                             2792        2653        2851
        ... for reclaimable                                                                           49          41          63
      
      What we can see is that "Extfrag fragmenting for unmovable" and "...
      placed with movable" drops with almost each patch, which is good as we
      are polluting less movable pageblocks with unmovable pages.
      
      The most significant change is patch 4 with movable page counting.  On
      the other hand it increases "Extfrag fragmenting for movable" by 50%.
      "Pages steal" drops though, so these movable allocation fallbacks find
      only small free pages and are not allowed to steal whole pageblocks
      back.  "Pages steal with pageblock" raises, because the patch increases
      the chances of pageblock migratetype changes to happen.  This affects
      all migratetypes.
      
      The summary is that patch 4 is not a clear win wrt these stats, but I
      believe that the tradeoff it makes is a good one.  There's less
      pollution of movable pageblocks by unmovable allocations.  There's less
      stealing between pageblock, and those that remain have higher chance of
      changing migratetype also the pageblock itself, so it should more
      faithfully reflect the migratetype of the pages within the pageblock.
      The increase of movable allocations falling back to unmovable pageblock
      might look dramatic, but those allocations can be migrated by compaction
      when needed, and other patches in the series (7-9) improve that aspect.
      
      Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
      also reduce the impact on movable fallbacks from patch 4.
      
      [1] https://www.spinics.net/lists/linux-mm/msg114237.html
      
      This patch (of 8):
      
      While currently there are (mostly by accident) no holes in struct
      compact_control (on x86_64), but we are going to add more bool flags, so
      place them all together to the end of the structure.  While at it, just
      order all fields from largest to smallest.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f25ba6dc
  5. May 08, 2017
    • Al Viro's avatar
      fix braino in generic_file_read_iter() · 5b47d59a
      Al Viro authored
      
      
      Wrong sign of iov_iter_revert() argument.  Unfortunately, slipped through
      the testing, since most of the time we don't do anything to the iterator
      afterwards and potential oops on walking the iter->iov too far backwards
      is too infrequent to be easily triggered.
      
      Add a sanity check in iov_iter_revert() to catch bugs like this one;
      fortunately, the same braino hadn't happened in other callers, but we'd
      better have a warning if such thing crops up.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5b47d59a
  6. May 03, 2017
Loading