Skip to content
  1. Jul 18, 2017
    • Tejun Heo's avatar
      cgroup: create dfl_root files on subsys registration · 7af608e4
      Tejun Heo authored
      
      
      On subsystem registration, css_populate_dir() is not called on the new
      root css, so the interface files for the subsystem on cgrp_dfl_root
      aren't created on registration.  This is a residue from the days when
      cgrp_dfl_root was used only as the parking spot for unused subsystems,
      which no longer is true as it's used as the root for cgroup2.
      
      This is often fine as later operations tend to create them as a part
      of mount (cgroup1) or subtree_control operations (cgroup2); however,
      it's not difficult to mount cgroup2 with the controller interface
      files missing as Waiman found out.
      
      Fix it by invoking css_populate_dir() on the root css on subsys
      registration.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarWaiman Long <longman@redhat.com>
      Cc: stable@vger.kernel.org # v4.5+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      7af608e4
  2. Jul 08, 2017
    • Tejun Heo's avatar
      cgroup: don't call migration methods if there are no tasks to migrate · 61046727
      Tejun Heo authored
      
      
      Subsystem migration methods shouldn't be called for empty migrations.
      cgroup_migrate_execute() implements this guarantee by bailing early if
      there are no source css_sets.  This used to be correct before
      a79a908f ("cgroup: introduce cgroup namespaces"), but no longer
      since the commit because css_sets can stay pinned without tasks in
      them.
      
      This caused cgroup_migrate_execute() call into cpuset migration
      methods with an empty cgroup_taskset.  cpuset migration methods
      correctly assume that cgroup_taskset_first() never returns NULL;
      however, due to the bug, it can, leading to the following oops.
      
        Unable to handle kernel paging request for data at address 0x00000960
        Faulting instruction address: 0xc0000000001d6868
        Oops: Kernel access of bad area, sig: 11 [#1]
        ...
        CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G        W
        4.12.0-rc4-next-20170609 #2
        Workqueue: events cpuset_hotplug_workfn
        task: c00000000ca60580 task.stack: c00000000c728000
        NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810
        REGS: c00000000c72b720 TRAP: 0300   Tainted: GW (4.12.0-rc4-next-20170609)
        MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 44722422  XER: 20000000
        CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1
        GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000
        GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678
        GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000
        GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180
        GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0
        GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20
        GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20
        GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000
        NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0
        LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0
        Call Trace:
        [c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable)
        [c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450
        [c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360
        [c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20
        [c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580
        [c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0
        [c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0
        [c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74
        Instruction dump:
        f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25
        60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200
        ---[ end trace dcaaf98fb36d9e64 ]---
      
      This patch fixes the bug by adding an explicit nr_tasks counter to
      cgroup_taskset and skipping calling the migration methods if the
      counter is zero.  While at it, remove the now spurious check on no
      source css_sets.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
      61046727
  3. Jul 06, 2017
    • Johannes Weiner's avatar
      mm: memcontrol: use generic mod_memcg_page_state for kmem pages · ed52be7b
      Johannes Weiner authored
      The kmem-specific functions do the same thing.  Switch and drop.
      
      Link: http://lkml.kernel.org/r/20170530181724.27197-5-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed52be7b
    • Vlastimil Babka's avatar
      mm, cpuset: always use seqlock when changing task's nodemask · 5f155f27
      Vlastimil Babka authored
      When updating task's mems_allowed and rebinding its mempolicy due to
      cpuset's mems being changed, we currently only take the seqlock for
      writing when either the task has a mempolicy, or the new mems has no
      intersection with the old mems.
      
      This should be enough to prevent a parallel allocation seeing no
      available nodes, but the optimization is IMHO unnecessary (cpuset
      updates should not be frequent), and we still potentially risk issues if
      the intersection of new and old nodes has limited amount of
      free/reclaimable memory.
      
      Let's just use the seqlock for all tasks.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-6-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f155f27
    • Vlastimil Babka's avatar
      mm, mempolicy: simplify rebinding mempolicies when updating cpusets · 213980c0
      Vlastimil Babka authored
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") has introduced a two-step protocol when
      rebinding task's mempolicy due to cpuset update, in order to avoid a
      parallel allocation seeing an empty effective nodemask and failing.
      
      Later, commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory
      barrier related damage v3") introduced a seqlock protection and removed
      the synchronization point between the two update steps.  At that point
      (or perhaps later), the two-step rebinding became unnecessary.
      
      Currently it only makes sure that the update first adds new nodes in
      step 1 and then removes nodes in step 2.  Without memory barriers the
      effects are questionable, and even then this cannot prevent a parallel
      zonelist iteration checking the nodemask at each step to observe all
      nodes as unusable for allocation.  We now fully rely on the seqlock to
      prevent premature OOMs and allocation failures.
      
      We can thus remove the two-step update parts and simplify the code.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-5-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      213980c0
    • Pavel Tatashin's avatar
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin authored
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
    • Mike Rapoport's avatar
      kernel/exit.c: don't include unused userfaultfd_k.h · 57ecbd38
      Mike Rapoport authored
      Commit dd0db88d ("userfaultfd: non-cooperative: rollback
      userfaultfd_exit") removed userfaultfd callback from exit() which makes
      the include of <linux/userfaultfd_k.h> unnecessary.
      
      Link: http://lkml.kernel.org/r/1494930907-3060-1-git-send-email-rppt@linux.vnet.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57ecbd38
    • Michal Hocko's avatar
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko authored
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • Michal Hocko's avatar
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko authored
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarWei Yang <richard.weiyang@gmail.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Tested-by: default avatarReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • Michael Ellerman's avatar
      kernel/module.c: use linux/set_memory.h · 563ec5cb
      Michael Ellerman authored
      This header always exists, so doesn't require an ifdef around its
      inclusion.  When CONFIG_ARCH_HAS_SET_MEMORY=y it includes the asm
      header, otherwise it provides empty versions of the set_memory_xx()
      routines.
      
      The usages of set_memory_xx() are still guarded by
      CONFIG_STRICT_MODULE_RWX.
      
      Link: http://lkml.kernel.org/r/1498717781-29151-3-git-send-email-mpe@ellerman.id.au
      
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      563ec5cb
    • Michael Ellerman's avatar
      kernel/power/snapshot.c: use linux/set_memory.h · 61f6d09a
      Michael Ellerman authored
      This header always exists, so doesn't require an ifdef around its
      inclusion.  When CONFIG_ARCH_HAS_SET_MEMORY=y it includes the asm
      header, otherwise it provides empty versions of the set_memory_xx()
      routines.
      
      Link: http://lkml.kernel.org/r/1498717781-29151-2-git-send-email-mpe@ellerman.id.au
      
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61f6d09a
    • Marcin Nowakowski's avatar
      kernel/extable.c: mark core_kernel_text notrace · c0d80dda
      Marcin Nowakowski authored
      core_kernel_text is used by MIPS in its function graph trace processing,
      so having this method traced leads to an infinite set of recursive calls
      such as:
      
        Call Trace:
           ftrace_return_to_handler+0x50/0x128
           core_kernel_text+0x10/0x1b8
           prepare_ftrace_return+0x6c/0x114
           ftrace_graph_caller+0x20/0x44
           return_to_handler+0x10/0x30
           return_to_handler+0x0/0x30
           return_to_handler+0x0/0x30
           ftrace_ops_no_ops+0x114/0x1bc
           core_kernel_text+0x10/0x1b8
           core_kernel_text+0x10/0x1b8
           core_kernel_text+0x10/0x1b8
           ftrace_ops_no_ops+0x114/0x1bc
           core_kernel_text+0x10/0x1b8
           prepare_ftrace_return+0x6c/0x114
           ftrace_graph_caller+0x20/0x44
           (...)
      
      Mark the function notrace to avoid it being traced.
      
      Link: http://lkml.kernel.org/r/1498028607-6765-1-git-send-email-marcin.nowakowski@imgtec.com
      
      
      Signed-off-by: default avatarMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Meyer <thomas@m3y3r.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0d80dda
  4. Jul 05, 2017
  5. Jul 03, 2017
    • John Fastabend's avatar
      bpf, verifier: add additional patterns to evaluate_reg_imm_alu · 43188702
      John Fastabend authored
      
      
      Currently the verifier does not track imm across alu operations when
      the source register is of unknown type. This adds additional pattern
      matching to catch this and track imm. We've seen LLVM generating this
      pattern while working on cilium.
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43188702
    • John Fastabend's avatar
      bpf: extend bpf_trace_printk to support %i · 7bda4b40
      John Fastabend authored
      
      
      Currently, bpf_trace_printk does not support common formatting
      symbol '%i' however vsprintf does and is what eventually gets
      called by bpf helper. If users are used to '%i' and currently
      make use of it, then bpf_trace_printk will just return with
      error without dumping anything to the trace pipe, so just add
      support for '%i' to the helper.
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7bda4b40
    • Daniel Borkmann's avatar
      bpf: export whether tail call has jited owner · 9780c0ab
      Daniel Borkmann authored
      
      
      We do export through fdinfo already whether a prog is JITed or not,
      given a program load can fail in case of either prog or tail call map
      has JITed property, but neither both are JITed or not JITed, we can
      facilitate error reporting in loaders like iproute2 through exporting
      owner_jited of tail call map. We already do export owner_prog_type
      through this facility, so parser can pick up both for comparison.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9780c0ab
    • Daniel Borkmann's avatar
      bpf: simplify narrower ctx access · f96da094
      Daniel Borkmann authored
      
      
      This work tries to make the semantics and code around the
      narrower ctx access a bit easier to follow. Right now
      everything is done inside the .is_valid_access(). Offset
      matching is done differently for read/write types, meaning
      writes don't support narrower access and thus matching only
      on offsetof(struct foo, bar) is enough whereas for read
      case that supports narrower access we must check for
      offsetof(struct foo, bar) + offsetof(struct foo, bar) +
      sizeof(<bar>) - 1 for each of the cases. For read cases of
      individual members that don't support narrower access (like
      packet pointers or skb->cb[] case which has its own narrow
      access logic), we check as usual only offsetof(struct foo,
      bar) like in write case. Then, for the case where narrower
      access is allowed, we also need to set the aux info for the
      access. Meaning, ctx_field_size and converted_op_size have
      to be set. First is the original field size e.g. sizeof(<bar>)
      as in above example from the user facing ctx, and latter
      one is the target size after actual rewrite happened, thus
      for the kernel facing ctx. Also here we need the range match
      and we need to keep track changing convert_ctx_access() and
      converted_op_size from is_valid_access() as both are not at
      the same location.
      
      We can simplify the code a bit: check_ctx_access() becomes
      simpler in that we only store ctx_field_size as a meta data
      and later in convert_ctx_accesses() we fetch the target_size
      right from the location where we do convert. Should the verifier
      be misconfigured we do reject for BPF_WRITE cases or target_size
      that are not provided. For the subsystems, we always work on
      ranges in is_valid_access() and add small helpers for ranges
      and narrow access, convert_ctx_accesses() sets target_size
      for the relevant instruction.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f96da094
  6. Jul 01, 2017
    • Lawrence Brakmo's avatar
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo authored
      
      
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      
      Signed-off-by: default avatarLawrence Brakmo <brakmo@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40304b2a
  7. Jun 30, 2017
  8. Jun 29, 2017
    • Arvind Yadav's avatar
      PM: hibernate: constify attribute_group structures. · 59494fe2
      Arvind Yadav authored
      
      
      attribute_groups are not supposed to change at runtime. All functions
      working with attribute_groups provided by <linux/sysfs.h> work with const
      attribute_group. So mark the non-const structs as const.
      
      File size before:
         text	   data	    bss	    dec	    hex	filename
         6332	    488	    308	   7128	   1bd8	kernel/power/hibernate.o
      
      File size After adding 'const':
         text	   data	    bss	    dec	    hex	filename
         6396	    424	    308	   7128	   1bd8	kernel/power/hibernate.o
      
      Signed-off-by: default avatarArvind Yadav <arvind.yadav.cs@gmail.com>
      Acked-by: default avatarPavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      59494fe2
    • Daniel Borkmann's avatar
      bpf: prevent leaking pointer via xadd on unpriviledged · 6bdf6abc
      Daniel Borkmann authored
      Leaking kernel addresses on unpriviledged is generally disallowed,
      for example, verifier rejects the following:
      
        0: (b7) r0 = 0
        1: (18) r2 = 0xffff897e82304400
        3: (7b) *(u64 *)(r1 +48) = r2
        R2 leaks addr into ctx
      
      Doing pointer arithmetic on them is also forbidden, so that they
      don't turn into unknown value and then get leaked out. However,
      there's xadd as a special case, where we don't check the src reg
      for being a pointer register, e.g. the following will pass:
      
        0: (b7) r0 = 0
        1: (7b) *(u64 *)(r1 +48) = r0
        2: (18) r2 = 0xffff897e82304400 ; map
        4: (db) lock *(u64 *)(r1 +48) += r2
        5: (95) exit
      
      We could store the pointer into skb->cb, loose the type context,
      and then read it out from there again to leak it eventually out
      of a map value. Or more easily in a different variant, too:
      
         0: (bf) r6 = r1
         1: (7a) *(u64 *)(r10 -8) = 0
         2: (bf) r2 = r10
         3: (07) r2 += -8
         4: (18) r1 = 0x0
         6: (85) call bpf_map_lookup_elem#1
         7: (15) if r0 == 0x0 goto pc+3
         R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
         8: (b7) r3 = 0
         9: (7b) *(u64 *)(r0 +0) = r3
        10: (db) lock *(u64 *)(r0 +0) += r6
        11: (b7) r0 = 0
        12: (95) exit
      
        from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
        11: (b7) r0 = 0
        12: (95) exit
      
      Prevent this by checking xadd src reg for pointer types. Also
      add a couple of test cases related to this.
      
      Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
      Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
      Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: Alexei Starovoitov <ast@kernel.org>
      Acked-by: Martin KaFai Lau <kafai@fb.com>
      Acked-by: Edward Cree <ecree@solarflare.com>
      Signed-off-by: David S. Miller <davem@davemloft.net>
      6bdf6abc
    • Martin KaFai Lau's avatar
      bpf: Fix out-of-bound access on interpreters[] · 8007e40a
      Martin KaFai Lau authored
      
      
      The index is off-by-one when fp->aux->stack_depth
      has already been rounded up to 32.  In particular,
      if stack_depth is 512, the index will be 16.
      
      The fix is to round_up and then takes -1 instead of round_down.
      
      [   22.318680] ==================================================================
      [   22.319745] BUG: KASAN: global-out-of-bounds in bpf_prog_select_runtime+0x48a/0x670
      [   22.320737] Read of size 8 at addr ffffffff82aadae0 by task sockex3/1946
      [   22.321646]
      [   22.321858] CPU: 1 PID: 1946 Comm: sockex3 Tainted: G        W       4.12.0-rc6-01680-g2ee87db3a287 #22
      [   22.323061] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
      [   22.324260] Call Trace:
      [   22.324612]  dump_stack+0x67/0x99
      [   22.325081]  print_address_description+0x1e8/0x290
      [   22.325734]  ? bpf_prog_select_runtime+0x48a/0x670
      [   22.326360]  kasan_report+0x265/0x350
      [   22.326860]  __asan_report_load8_noabort+0x19/0x20
      [   22.327484]  bpf_prog_select_runtime+0x48a/0x670
      [   22.328109]  bpf_prog_load+0x626/0xd40
      [   22.328637]  ? __bpf_prog_charge+0xc0/0xc0
      [   22.329222]  ? check_nnp_nosuid.isra.61+0x100/0x100
      [   22.329890]  ? __might_fault+0xf6/0x1b0
      [   22.330446]  ? lock_acquire+0x360/0x360
      [   22.331013]  SyS_bpf+0x67c/0x24d0
      [   22.331491]  ? trace_hardirqs_on+0xd/0x10
      [   22.332049]  ? __getnstimeofday64+0xaf/0x1c0
      [   22.332635]  ? bpf_prog_get+0x20/0x20
      [   22.333135]  ? __audit_syscall_entry+0x300/0x600
      [   22.333770]  ? syscall_trace_enter+0x540/0xdd0
      [   22.334339]  ? exit_to_usermode_loop+0xe0/0xe0
      [   22.334950]  ? do_syscall_64+0x48/0x410
      [   22.335446]  ? bpf_prog_get+0x20/0x20
      [   22.335954]  do_syscall_64+0x181/0x410
      [   22.336454]  entry_SYSCALL64_slow_path+0x25/0x25
      [   22.337121] RIP: 0033:0x7f263fe81f19
      [   22.337618] RSP: 002b:00007ffd9a3440c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000141
      [   22.338619] RAX: ffffffffffffffda RBX: 0000000000aac5fb RCX: 00007f263fe81f19
      [   22.339600] RDX: 0000000000000030 RSI: 00007ffd9a3440d0 RDI: 0000000000000005
      [   22.340470] RBP: 0000000000a9a1e0 R08: 0000000000a9a1e0 R09: 0000009d00000001
      [   22.341430] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000010000
      [   22.342411] R13: 0000000000a9a023 R14: 0000000000000001 R15: 0000000000000003
      [   22.343369]
      [   22.343593] The buggy address belongs to the variable:
      [   22.344241]  interpreters+0x80/0x980
      [   22.344708]
      [   22.344908] Memory state around the buggy address:
      [   22.345556]  ffffffff82aad980: 00 00 00 04 fa fa fa fa 04 fa fa fa fa fa fa fa
      [   22.346449]  ffffffff82aada00: 00 00 00 00 00 fa fa fa fa fa fa fa 00 00 00 00
      [   22.347361] >ffffffff82aada80: 00 00 00 00 00 00 00 00 00 00 00 00 fa fa fa fa
      [   22.348301]                                                        ^
      [   22.349142]  ffffffff82aadb00: 00 01 fa fa fa fa fa fa 00 00 00 00 00 00 00 00
      [   22.350058]  ffffffff82aadb80: 00 00 07 fa fa fa fa fa 00 00 05 fa fa fa fa fa
      [   22.350984] ==================================================================
      
      Fixes: b870aa90 ("bpf: use different interpreter depending on required stack size")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8007e40a
    • Martin KaFai Lau's avatar
      bpf: Add syscall lookup support for fd array and htab · 14dc6f04
      Martin KaFai Lau authored
      
      
      This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
      BPF_MAP_TYPE_PROG_ARRAY,
      BPF_MAP_TYPE_ARRAY_OF_MAPS and
      BPF_MAP_TYPE_HASH_OF_MAPS.
      
      The lookup returns a prog-id or map-id to the userspace.
      The userspace can then use the BPF_PROG_GET_FD_BY_ID
      or BPF_MAP_GET_FD_BY_ID to get a fd.
      
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14dc6f04
    • Steven Rostedt (VMware)'s avatar
      ftrace: Fix regression with module command in stack_trace_filter · 0f179765
      Steven Rostedt (VMware) authored
      
      
      When doing the following command:
      
       # echo ":mod:kvm_intel" > /sys/kernel/tracing/stack_trace_filter
      
      it triggered a crash.
      
      This happened with the clean up of probes. It required all callers to the
      regex function (doing ftrace filtering) to have ops->private be a pointer to
      a trace_array. But for the stack tracer, that is not the case.
      
      Allow for the ops->private to be NULL, and change the function command
      callbacks to handle the trace_array pointer being NULL as well.
      
      Fixes: d2afd57a ("tracing/ftrace: Allow instances to have their own function probes")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      0f179765
    • Thomas Gleixner's avatar
      sched/numa: Hide numa_wake_affine() from UP build · ff801b71
      Thomas Gleixner authored
      
      
      Stephen reported the following build warning in UP:
      
      kernel/sched/fair.c:2657:9: warning: 'struct sched_domain' declared inside
      parameter list
               ^
      /home/sfr/next/next/kernel/sched/fair.c:2657:9: warning: its scope is only this
      definition or declaration, which is probably not what you want
      
      Hide the numa_wake_affine() inline stub on UP builds to get rid of it.
      
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Reported-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      ff801b71
  9. Jun 28, 2017
  10. Jun 27, 2017
Loading