Skip to content
  1. Aug 20, 2017
    • Daniel Borkmann's avatar
      bpf: inline map in map lookup functions for array and htab · 7b0c2a05
      Daniel Borkmann authored
      
      
      Avoid two successive functions calls for the map in map lookup, first
      is the bpf_map_lookup_elem() helper call, and second the callback via
      map->ops->map_lookup_elem() to get to the map in map implementation.
      Implementation inlines array and htab flavor for map in map lookups.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7b0c2a05
    • Daniel Borkmann's avatar
      bpf: make htab inlining more robust wrt assumptions · 89c63074
      Daniel Borkmann authored
      
      
      Commit 9015d2f5 ("bpf: inline htab_map_lookup_elem()") was
      making the assumption that a direct call emission to the function
      __htab_map_lookup_elem() will always work out for JITs.
      
      This is currently true since all JITs we have are for 64 bit archs,
      but in case of 32 bit JITs like upcoming arm32, we get a NULL pointer
      dereference when executing the call to __htab_map_lookup_elem()
      since passed arguments are of a different size (due to pointer args)
      than what we do out of BPF. Guard and thus limit this for now for
      the current 64 bit JITs only.
      
      Reported-by: default avatarShubham Bansal <illusionist.neo@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89c63074
    • Martin KaFai Lau's avatar
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau authored
      
      
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96eabe7a
  2. Aug 18, 2017
  3. Aug 17, 2017
  4. Aug 16, 2017
    • John Fastabend's avatar
      bpf: sock_map fixes for !CONFIG_BPF_SYSCALL and !STREAM_PARSER · 6bdc9c4c
      John Fastabend authored
      
      
      Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER
      
      net/core/filter.c: In function ‘do_sk_redirect_map’:
      net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration]
         sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
         ^
      net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default]
         sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: default avatarEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bdc9c4c
    • John Fastabend's avatar
      bpf: sockmap state change warning fix · cf56e3b9
      John Fastabend authored
      
      
      psock will uninitialized in default case we need to do the same psock lookup
      and check as in other branch. Fixes compile warning below.
      
      kernel/bpf/sockmap.c: In function ‘smap_state_change’:
      kernel/bpf/sockmap.c:156:21: warning: ‘psock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
        struct smap_psock *psock;
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: default avatarDavid Miller <davem@davemloft.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf56e3b9
    • John Fastabend's avatar
      bpf: devmap: remove unnecessary value size check · cf9d0140
      John Fastabend authored
      
      
      In the devmap alloc map logic we check to ensure that the sizeof the
      values are not greater than KMALLOC_MAX_SIZE. But, in the dev map case
      we ensure the value size is 4bytes earlier in the function because all
      values should be netdev ifindex values.
      
      The second check is harmless but is not needed so remove it.
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf9d0140
    • John Fastabend's avatar
    • John Fastabend's avatar
      bpf: sockmap with sk redirect support · 174a79ff
      John Fastabend authored
      
      
      Recently we added a new map type called dev map used to forward XDP
      packets between ports (6093ec2d). This patches introduces a
      similar notion for sockets.
      
      A sockmap allows users to add participating sockets to a map. When
      sockets are added to the map enough context is stored with the
      map entry to use the entry with a new helper
      
        bpf_sk_redirect_map(map, key, flags)
      
      This helper (analogous to bpf_redirect_map in XDP) is given the map
      and an entry in the map. When called from a sockmap program, discussed
      below, the skb will be sent on the socket using skb_send_sock().
      
      With the above we need a bpf program to call the helper from that will
      then implement the send logic. The initial site implemented in this
      series is the recv_sock hook. For this to work we implemented a map
      attach command to add attributes to a map. In sockmap we add two
      programs a parse program and a verdict program. The parse program
      uses strparser to build messages and pass them to the verdict program.
      The parse programs use the normal strparser semantics. The verdict
      program is of type SK_SKB.
      
      The verdict program returns a verdict SK_DROP, or  SK_REDIRECT for
      now. Additional actions may be added later. When SK_REDIRECT is
      returned, expected when bpf program uses bpf_sk_redirect_map(), the
      sockmap logic will consult per cpu variables set by the helper routine
      and pull the sock entry out of the sock map. This pattern follows the
      existing redirect logic in cls and xdp programs.
      
      This gives the flow,
      
       recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
                                                           \
                                                            -> kfree_skb
      
      As an example use case a message based load balancer may use specific
      logic in the verdict program to select the sock to send on.
      
      Sample programs are provided in future patches that hopefully illustrate
      the user interfaces. Also selftests are in follow-on patches.
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      174a79ff
    • John Fastabend's avatar
      bpf: export bpf_prog_inc_not_zero · a6f6df69
      John Fastabend authored
      
      
      bpf_prog_inc_not_zero will be used by upcoming sockmap patches this
      patch simply exports it so we can pull it in.
      
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a6f6df69
    • Daniel Borkmann's avatar
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann authored
      
      
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Tested-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88a5c690
  5. Aug 15, 2017
    • Edward Cree's avatar
      bpf/verifier: track liveness for pruning · dc503a8a
      Edward Cree authored
      
      
      State of a register doesn't matter if it wasn't read in reaching an exit;
       a write screens off all reads downstream of it from all explored_states
       upstream of it.
      This allows us to prune many more branches; here are some processed insn
       counts for some Cilium programs:
      Program                  before  after
      bpf_lb_opt_-DLB_L3.o       6515   3361
      bpf_lb_opt_-DLB_L4.o       8976   5176
      bpf_lb_opt_-DUNKNOWN.o     2960   1137
      bpf_lxc_opt_-DDROP_ALL.o  95412  48537
      bpf_lxc_opt_-DUNKNOWN.o  141706  78718
      bpf_netdev.o              24251  17995
      bpf_overlay.o             10999   9385
      
      The runtime is also improved; here are 'time' results in ms:
      Program                  before  after
      bpf_lb_opt_-DLB_L3.o         24      6
      bpf_lb_opt_-DLB_L4.o         26     11
      bpf_lb_opt_-DUNKNOWN.o       11      2
      bpf_lxc_opt_-DDROP_ALL.o   1288    139
      bpf_lxc_opt_-DUNKNOWN.o    1768    234
      bpf_netdev.o                 62     31
      bpf_overlay.o                15     13
      
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc503a8a
  6. Aug 10, 2017
    • Nadav Amit's avatar
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit authored
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      
      
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • Johannes Weiner's avatar
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner authored
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  7. Aug 09, 2017
    • Daniel Borkmann's avatar
      bpf: enable BPF_J{LT, LE, SLT, SLE} opcodes in verifier · b4e432f1
      Daniel Borkmann authored
      
      
      Enable the newly added jump opcodes, main parts are in two
      different areas, namely direct packet access and dynamic map
      value access. For the direct packet access, we now allow for
      the following two new patterns to match in order to trigger
      markings with find_good_pkt_pointers():
      
      Variant 1 (access ok when taking the branch):
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (bf) r0 = r2
        3: (07) r0 += 8
        4: (ad) if r0 < r3 goto pc+2
        R0=pkt(id=0,off=8,r=0) R1=ctx R2=pkt(id=0,off=0,r=0)
        R3=pkt_end R10=fp
        5: (b7) r0 = 0
        6: (95) exit
      
        from 4 to 7: R0=pkt(id=0,off=8,r=8) R1=ctx
                     R2=pkt(id=0,off=0,r=8) R3=pkt_end R10=fp
        7: (71) r0 = *(u8 *)(r2 +0)
        8: (05) goto pc-4
        5: (b7) r0 = 0
        6: (95) exit
        processed 11 insns, stack depth 0
      
      Variant 2 (access ok on fall-through):
      
        0: (61) r2 = *(u32 *)(r1 +76)
        1: (61) r3 = *(u32 *)(r1 +80)
        2: (bf) r0 = r2
        3: (07) r0 += 8
        4: (bd) if r3 <= r0 goto pc+1
        R0=pkt(id=0,off=8,r=8) R1=ctx R2=pkt(id=0,off=0,r=8)
        R3=pkt_end R10=fp
        5: (71) r0 = *(u8 *)(r2 +0)
        6: (b7) r0 = 1
        7: (95) exit
      
        from 4 to 6: R0=pkt(id=0,off=8,r=0) R1=ctx
                     R2=pkt(id=0,off=0,r=0) R3=pkt_end R10=fp
        6: (b7) r0 = 1
        7: (95) exit
        processed 10 insns, stack depth 0
      
      The above two basically just swap the branches where we need
      to handle an exception and allow packet access compared to the
      two already existing variants for find_good_pkt_pointers().
      
      For the dynamic map value access, we add the new instructions
      to reg_set_min_max() and reg_set_min_max_inv() in order to
      learn bounds. Verifier test cases for both are added in a
      follow-up patch.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4e432f1
    • Daniel Borkmann's avatar
      bpf: add BPF_J{LT,LE,SLT,SLE} instructions · 92b31a9a
      Daniel Borkmann authored
      Currently, eBPF only understands BPF_JGT (>), BPF_JGE (>=),
      BPF_JSGT (s>), BPF_JSGE (s>=) instructions, this means that
      particularly *JLT/*JLE counterparts involving immediates need
      to be rewritten from e.g. X < [IMM] by swapping arguments into
      [IMM] > X, meaning the immediate first is required to be loaded
      into a register Y := [IMM], such that then we can compare with
      Y > X. Note that the destination operand is always required to
      be a register.
      
      This has the downside of having unnecessarily increased register
      pressure, meaning complex program would need to spill other
      registers temporarily to stack in order to obtain an unused
      register for the [IMM]. Loading to registers will thus also
      affect state pruning since we need to account for that register
      use and potentially those registers that had to be spilled/filled
      again. As a consequence slightly more stack space might have
      been used due to spilling, and BPF programs are a bit longer
      due to extra code involving the register load and potentially
      required spill/fills.
      
      Thus, add BPF_JLT (<), BPF_JLE (<=), BPF_JSLT (s<), BPF_JSLE (s<=)
      counterparts to the eBPF instruction set. Modifying LLVM to
      remove the NegateCC() workaround in a PoC patch at [1] and
      allowing it to also emit the new instructions resulted in
      cilium's BPF programs that are injected into the fast-path to
      have a reduced program length in the range of 2-3% (e.g.
      accumulated main and tail call sections from one of the object
      file reduced from 4864 to 4729 insns), reduced complexity in
      the range of 10-30% (e.g. accumulated sections reduced in one
      of the cases from 116432 to 88428 insns), and reduced stack
      usage in the range of 1-5% (e.g. accumulated sections from one
      of the object files reduced from 824 to 784b).
      
      The modification for LLVM will be incorporated in a backwards
      compatible way. Plan is for LLVM to have i) a target specific
      option to offer a possibility to explicitly enable the extension
      by the user (as we have with -m target specific extensions today
      for various CPU insns), and ii) have the kernel checked for
      presence of the extensions and enable them transparently when
      the user is selecting more aggressive options such as -march=native
      in a bpf target context. (Other frontends generating BPF byte
      code, e.g. ply can probe the kernel directly for its code
      generation.)
      
        [1] https://github.com/borkmann/llvm/tree/bpf-insns
      
      
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92b31a9a
    • Mel Gorman's avatar
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman authored
      
      
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      
      Reported-and-tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
    • Mickaël Salaün's avatar
      bpf: Extend check_uarg_tail_zero() checks · 752ba56f
      Mickaël Salaün authored
      
      
      The function check_uarg_tail_zero() was created from bpf(2) for
      BPF_OBJ_GET_INFO_BY_FD without taking the access_ok() nor the PAGE_SIZE
      checks. Make this checks more generally available while unlikely to be
      triggered, extend the memory range check and add an explanation
      including why the ToCToU should not be a security concern.
      
      Signed-off-by: default avatarMickaël Salaün <mic@digikod.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Link: https://lkml.kernel.org/r/CAGXu5j+vRGFvJZmjtAcT8Hi8B+Wz0e1b6VKYZHfQP_=DXzC4CQ@mail.gmail.com
      
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      752ba56f
    • Mickaël Salaün's avatar
      bpf: Move check_uarg_tail_zero() upward · 58291a74
      Mickaël Salaün authored
      
      
      The function check_uarg_tail_zero() may be useful for other part of the
      code in the syscall.c file. Move this function at the beginning of the
      file.
      
      Signed-off-by: default avatarMickaël Salaün <mic@digikod.net>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      58291a74
    • Edward Cree's avatar
      bpf/verifier: increase complexity limit to 128k · 8e17c1b1
      Edward Cree authored
      
      
      The more detailed value tracking can reduce the effectiveness of pruning
       for some programs.  So, to avoid rejecting previously valid programs, up
       the limit to 128kinsns.  Hopefully we will be able to bring this back
       down later by improving pruning performance.
      
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e17c1b1
    • Edward Cree's avatar
    • Edward Cree's avatar
      bpf/verifier: track signed and unsigned min/max values · b03c9f9f
      Edward Cree authored
      
      
      Allows us to, sometimes, combine information from a signed check of one
       bound and an unsigned check of the other.
      We now track the full range of possible values, rather than restricting
       ourselves to [0, 1<<30) and considering anything beyond that as
       unknown.  While this is probably not necessary, it makes the code more
       straightforward and symmetrical between signed and unsigned bounds.
      
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b03c9f9f
    • Edward Cree's avatar
      bpf/verifier: rework value tracking · f1174f77
      Edward Cree authored
      
      
      Unifies adjusted and unadjusted register value types (e.g. FRAME_POINTER is
       now just a PTR_TO_STACK with zero offset).
      Tracks value alignment by means of tracking known & unknown bits.  This
       also replaces the 'reg->imm' (leading zero bits) calculations for (what
       were) UNKNOWN_VALUEs.
      If pointer leaks are allowed, and adjust_ptr_min_max_vals returns -EACCES,
       treat the pointer as an unknown scalar and try again, because we might be
       able to conclude something about the result (e.g. pointer & 0x40 is either
       0 or 0x40).
      Verifier hooks in the netronome/nfp driver were changed to match the new
       data structures.
      
      Signed-off-by: default avatarEdward Cree <ecree@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1174f77
  8. Aug 07, 2017
    • John Fastabend's avatar
      bpf: devmap fix mutex in rcu critical section · 4cc7b954
      John Fastabend authored
      
      
      Originally we used a mutex to protect concurrent devmap update
      and delete operations from racing with netdev unregister notifier
      callbacks.
      
      The notifier hook is needed because we increment the netdev ref
      count when a dev is added to the devmap. This ensures the netdev
      reference is valid in the datapath. However, we don't want to block
      unregister events, hence the initial mutex and notifier handler.
      
      The concern was in the notifier hook we search the map for dev
      entries that hold a refcnt on the net device being torn down. But,
      in order to do this we require two steps,
      
        (i) dereference the netdev:  dev = rcu_dereference(map[i])
       (ii) test ifindex:   dev->ifindex == removing_ifindex
      
      and then finally we can swap in the NULL dev in the map via an
      xchg operation,
      
        xchg(map[i], NULL)
      
      The danger here is a concurrent update could run a different
      xchg op concurrently leading us to replace the new dev with a
      NULL dev incorrectly.
      
            CPU 1                        CPU 2
      
         notifier hook                   bpf devmap update
      
         dev = rcu_dereference(map[i])
                                         dev = rcu_dereference(map[i])
                                         xchg(map[i]), new_dev);
                                         rcu_call(dev,...)
         xchg(map[i], NULL)
      
      The above flow would create the incorrect state with the dev
      reference in the update path being lost. To resolve this the
      original code used a mutex around the above block. However,
      updates, deletes, and lookups occur inside rcu critical sections
      so we can't use a mutex in this context safely.
      
      Fortunately, by writing slightly better code we can avoid the
      mutex altogether. If CPU 1 in the above example uses a cmpxchg
      and _only_ replaces the dev reference in the map when it is in
      fact the expected dev the race is removed completely. The two
      cases being illustrated here, first the race condition,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
      
         dev = rcu_dereference(map[i])
                                           dev = rcu_dereference(map[i])
                                           xchg(map[i]), new_dev);
                                           rcu_call(dev,...)
         odev = cmpxchg(map[i], dev, NULL)
      
      Now we can test the cmpxchg return value, detect odev != dev and
      abort. Or in the good case,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
         dev = rcu_dereference(map[i])
         odev = cmpxchg(map[i], dev, NULL)
                                           [...]
      
      Now 'odev == dev' and we can do proper cleanup.
      
      And viola the original race we tried to solve with a mutex is
      corrected and the trace noted by Sasha below is resolved due
      to removal of the mutex.
      
      Note: When walking the devmap and removing dev references as needed
      we depend on the core to fail any calls to dev_get_by_index() using
      the ifindex of the device being removed. This way we do not race with
      the user while searching the devmap.
      
      Additionally, the mutex was also protecting list add/del/read on
      the list of maps in-use. This patch converts this to an RCU list
      and spinlock implementation. This protects the list from concurrent
      alloc/free operations. The notifier hook walks this list so it uses
      RCU read semantics.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
      in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
      1 lock held by syz-executor1/16315:
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
      
      Fixes: 2ddf71e2 ("net: add notifier hooks for devmap bpf map")
      Reported-by: default avatarSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4cc7b954
    • Yonghong Song's avatar
      bpf: add support for sys_enter_* and sys_exit_* tracepoints · cf5f5cea
      Yonghong Song authored
      Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
      style tracepoints. The iovisor/bcc issue #748
      (https://github.com/iovisor/bcc/issues/748
      
      ) documents this issue.
      For example, if you try to attach a bpf program to tracepoints
      syscalls/sys_enter_newfstat, you will get the following error:
         # ./tools/trace.py t:syscalls:sys_enter_newfstat
         Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
         Failed to attach BPF to tracepoint
      
      The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
      tracepoints are treated differently from other tracepoints and there
      is no bpf hook to it.
      
      This patch adds bpf support for these syscalls tracepoints by
        . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
        . calling bpf programs in perf_syscall_enter and perf_syscall_exit
      
      The legality of bpf program ctx access is also checked.
      Function trace_event_get_offsets returns correct max offset for each
      specific syscall tracepoint, which is compared against the maximum offset
      access in bpf program.
      
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cf5f5cea
  9. Aug 06, 2017
  10. Aug 03, 2017
    • Dima Zavin's avatar
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin authored
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      
      
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: default avatarDima Zavin <dmitriyz@waymo.com>
      Reported-by: default avatarCliff Spradlin <cspradlin@waymo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
  11. Aug 02, 2017
  12. Aug 01, 2017
  13. Jul 30, 2017
  14. Jul 29, 2017
  15. Jul 28, 2017
    • Michael Bringmann's avatar
      workqueue: Work around edge cases for calc of pool's cpumask · 1ad0f0a7
      Michael Bringmann authored
      
      
      There is an underlying assumption/trade-off in many layers of the Linux
      system that CPU <-> node mapping is static.  This is despite the presence
      of features like NUMA and 'hotplug' that support the dynamic addition/
      removal of fundamental system resources like CPUs and memory.  PowerPC
      systems, however, do provide extensive features for the dynamic change
      of resources available to a system.
      
      Currently, there is little or no synchronization protection around the
      updating of the CPU <-> node mapping, and the export/update of this
      information for other layers / modules.  In systems which can change
      this mapping during 'hotplug', like PowerPC, the information is changing
      underneath all layers that might reference it.
      
      This patch attempts to ensure that a valid, usable cpumask attribute
      is used by the workqueue infrastructure when setting up new resource
      pools.  It prevents a crash that has been observed when an 'empty'
      cpumask is passed along to the worker/task scheduling code.  It is
      intended as a temporary workaround until a more fundamental review and
      correction of the issue can be done.
      
      [With additions to the patch provided by Tejun Hao <tj@kernel.org>]
      
      Signed-off-by: default avatarMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      1ad0f0a7
  16. Jul 27, 2017
    • Thomas Gleixner's avatar
      genirq/cpuhotplug: Revert "Set force affinity flag on hotplug migration" · 83979133
      Thomas Gleixner authored
      
      
      That commit was part of the changes moving x86 to the generic CPU hotplug
      interrupt migration code. The force flag was required on x86 before the
      hierarchical irqdomain rework, but invoking set_affinity() with force=true
      stayed and had no side effects.
      
      At some point in the past, the force flag got repurposed to support the
      exynos timer interrupt affinity setting to a not yet online CPU, so the
      interrupt controller callback does not verify the supplied affinity mask
      against cpu_online_mask.
      
      Setting the flag in the CPU hotplug code causes the cpu online masking to
      be blocked on these irq controllers and results in potentially affining an
      interrupt to the CPU which is unplugged, i.e. instead of moving it away,
      it's just reassigned to it.
      
      As the force flags is not longer needed on x86, it's safe to revert that
      patch so the ARM irqchips which use the force flag work again.
      
      Add comments to that effect, so this won't happen again.
      
      Note: The online mask handling should be done in the generic code and the
      force flag and the masking in the irq chips removed all together, but
      that's not a change possible for 4.13. 
      
      Fixes: 77f85e66 ("genirq/cpuhotplug: Set force affinity flag on hotplug migration")
      Reported-by: default avatarWill Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarWill Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: LAK <linux-arm-kernel@lists.infradead.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707271217590.3109@nanos
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      83979133
  17. Jul 25, 2017
    • Tejun Heo's avatar
      workqueue: implicit ordered attribute should be overridable · 0a94efb5
      Tejun Heo authored
      
      
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered") automatically enabled ordered attribute for unbound
      workqueues w/ max_active == 1.  Because ordered workqueues reject
      max_active and some attribute changes, this implicit ordered mode
      broke cases where the user creates an unbound workqueue w/ max_active
      == 1 and later explicitly changes the related attributes.
      
      This patch distinguishes explicit and implicit ordered setting and
      overrides from attribute changes if implict.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: 5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      0a94efb5
    • Jonathan Corbet's avatar
      sched/core: Fix some documentation build warnings · bf50f0e8
      Jonathan Corbet authored
      
      
      The kerneldoc comments for try_to_wake_up_local() were out of date, leading
      to these documentation build warnings:
      
        ./kernel/sched/core.c:2080: warning: No description found for parameter 'rf'
        ./kernel/sched/core.c:2080: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local'
      
      Update the comment to reflect current reality and give us some peace and
      quiet.
      
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170724135628.695cecfc@lwn.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bf50f0e8
  18. Jul 24, 2017
Loading