Skip to content
  1. Aug 17, 2020
  2. Aug 16, 2020
  3. Aug 15, 2020
  4. Aug 14, 2020
  5. Aug 13, 2020
    • Tonghao Zhang's avatar
      net: openvswitch: introduce common code for flushing flows · 1f3a090b
      Tonghao Zhang authored
      
      
      To avoid some issues, for example RCU usage warning and double free,
      we should flush the flows under ovs_lock. This patch refactors
      table_instance_destroy and introduces table_instance_flow_flush
      which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
      
      Fixes: 50b0e61b ("net: openvswitch: fix possible memleak on destroy flow-table")
      Reported-by: default avatarJohan Knöös <jknoos@google.com>
      Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050489.html
      
      
      Signed-off-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Reviewed-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1f3a090b
    • John Ogness's avatar
      af_packet: TPACKET_V3: fix fill status rwlock imbalance · 88fd1cb8
      John Ogness authored
      
      
      After @blk_fill_in_prog_lock is acquired there is an early out vnet
      situation that can occur. In that case, the rwlock needs to be
      released.
      
      Also, since @blk_fill_in_prog_lock is only acquired when @tp_version
      is exactly TPACKET_V3, only release it on that exact condition as
      well.
      
      And finally, add sparse annotation so that it is clearer that
      prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring
      and releasing @blk_fill_in_prog_lock, respectively. sparse is still
      unable to understand the balance, but the warnings are now on a
      higher level that make more sense.
      
      Fixes: 632ca50f ("af_packet: TPACKET_V3: replace busy-wait loop")
      Signed-off-by: default avatarJohn Ogness <john.ogness@linutronix.de>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88fd1cb8
    • John Fastabend's avatar
      bpf: sock_ops sk access may stomp registers when dst_reg = src_reg · 84f44df6
      John Fastabend authored
      
      
      Similar to patch ("bpf: sock_ops ctx access may stomp registers") if the
      src_reg = dst_reg when reading the sk field of a sock_ops struct we
      generate xlated code,
      
        53: (61) r9 = *(u32 *)(r9 +28)
        54: (15) if r9 == 0x0 goto pc+3
        56: (79) r9 = *(u64 *)(r9 +0)
      
      This stomps on the r9 reg to do the sk_fullsock check and then when
      reading the skops->sk field instead of the sk pointer we get the
      sk_fullsock. To fix use similar pattern noted in the previous fix
      and use the temp field to save/restore a register used to do
      sk_fullsock check.
      
      After the fix the generated xlated code reads,
      
        52: (7b) *(u64 *)(r9 +32) = r8
        53: (61) r8 = *(u32 *)(r9 +28)
        54: (15) if r9 == 0x0 goto pc+3
        55: (79) r8 = *(u64 *)(r9 +32)
        56: (79) r9 = *(u64 *)(r9 +0)
        57: (05) goto pc+1
        58: (79) r8 = *(u64 *)(r9 +32)
      
      Here r9 register was in-use so r8 is chosen as the temporary register.
      In line 52 r8 is saved in temp variable and at line 54 restored in case
      fullsock != 0. Finally we handle fullsock == 0 case by restoring at
      line 58.
      
      This adds a new macro SOCK_OPS_GET_SK it is almost possible to merge
      this with SOCK_OPS_GET_FIELD, but I found the extra branch logic a
      bit more confusing than just adding a new macro despite a bit of
      duplicating code.
      
      Fixes: 1314ef56 ("bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/159718349653.4728.6559437186853473612.stgit@john-Precision-5820-Tower
      84f44df6
    • John Fastabend's avatar
      bpf: sock_ops ctx access may stomp registers in corner case · fd09af01
      John Fastabend authored
      I had a sockmap program that after doing some refactoring started spewing
      this splat at me:
      
      [18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
      [...]
      [18610.807359] Call Trace:
      [18610.807370]  ? 0xffffffffc114d0d5
      [18610.807382]  __cgroup_bpf_run_filter_sock_ops+0x7d/0xb0
      [18610.807391]  tcp_connect+0x895/0xd50
      [18610.807400]  tcp_v4_connect+0x465/0x4e0
      [18610.807407]  __inet_stream_connect+0xd6/0x3a0
      [18610.807412]  ? __inet_stream_connect+0x5/0x3a0
      [18610.807417]  inet_stream_connect+0x3b/0x60
      [18610.807425]  __sys_connect+0xed/0x120
      
      After some debugging I was able to build this simple reproducer,
      
       __section("sockops/reproducer_bad")
       int bpf_reproducer_bad(struct bpf_sock_ops *skops)
       {
              volatile __maybe_unused __u32 i = skops->snd_ssthresh;
              return 0;
       }
      
      And along the way noticed that below program ran without splat,
      
      __section("sockops/reproducer_good")
      int bpf_reproducer_good(struct bpf_sock_ops *skops)
      {
              volatile __maybe_unused __u32 i = skops->snd_ssthresh;
              volatile __maybe_unused __u32 family;
      
              compiler_barrier();
      
              family = skops->family;
              return 0;
      }
      
      So I decided to check out the code we generate for the above two
      programs and noticed each generates the BPF code you would expect,
      
      0000000000000000 <bpf_reproducer_bad>:
      ;       volatile __maybe_unused __u32 i = skops->snd_ssthresh;
             0:       r1 = *(u32 *)(r1 + 96)
             1:       *(u32 *)(r10 - 4) = r1
      ;       return 0;
             2:       r0 = 0
             3:       exit
      
      0000000000000000 <bpf_reproducer_good>:
      ;       volatile __maybe_unused __u32 i = skops->snd_ssthresh;
             0:       r2 = *(u32 *)(r1 + 96)
             1:       *(u32 *)(r10 - 4) = r2
      ;       family = skops->family;
             2:       r1 = *(u32 *)(r1 + 20)
             3:       *(u32 *)(r10 - 8) = r1
      ;       return 0;
             4:       r0 = 0
             5:       exit
      
      So we get reasonable assembly, but still something was causing the null
      pointer dereference. So, we load the programs and dump the xlated version
      observing that line 0 above 'r* = *(u32 *)(r1 +96)' is going to be
      translated by the skops access helpers.
      
      int bpf_reproducer_bad(struct bpf_sock_ops * skops):
      ; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
         0: (61) r1 = *(u32 *)(r1 +28)
         1: (15) if r1 == 0x0 goto pc+2
         2: (79) r1 = *(u64 *)(r1 +0)
         3: (61) r1 = *(u32 *)(r1 +2340)
      ; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
         4: (63) *(u32 *)(r10 -4) = r1
      ; return 0;
         5: (b7) r0 = 0
         6: (95) exit
      
      int bpf_reproducer_good(struct bpf_sock_ops * skops):
      ; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
         0: (61) r2 = *(u32 *)(r1 +28)
         1: (15) if r2 == 0x0 goto pc+2
         2: (79) r2 = *(u64 *)(r1 +0)
         3: (61) r2 = *(u32 *)(r2 +2340)
      ; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
         4: (63) *(u32 *)(r10 -4) = r2
      ; family = skops->family;
         5: (79) r1 = *(u64 *)(r1 +0)
         6: (69) r1 = *(u16 *)(r1 +16)
      ; family = skops->family;
         7: (63) *(u32 *)(r10 -8) = r1
      ; return 0;
         8: (b7) r0 = 0
         9: (95) exit
      
      Then we look at lines 0 and 2 above. In the good case we do the zero
      check in r2 and then load 'r1 + 0' at line 2. Do a quick cross-check
      into the bpf_sock_ops check and we can confirm that is the 'struct
      sock *sk' pointer field. But, in the bad case,
      
         0: (61) r1 = *(u32 *)(r1 +28)
         1: (15) if r1 == 0x0 goto pc+2
         2: (79) r1 = *(u64 *)(r1 +0)
      
      Oh no, we read 'r1 +28' into r1, this is skops->fullsock and then in
      line 2 we read the 'r1 +0' as a pointer. Now jumping back to our spat,
      
      [18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
      
      The 0x01 makes sense because that is exactly the fullsock value. And
      its not a valid dereference so we splat.
      
      To fix we need to guard the case when a program is doing a sock_ops field
      access with src_reg == dst_reg. This is already handled in the load case
      where the ctx_access handler uses a tmp register being careful to
      store the old value and restore it. To fix the get case test if
      src_reg == dst_reg and in this case do the is_fullsock test in the
      temporary register. Remembering to restore the temporary register before
      writing to either dst_reg or src_reg to avoid smashing the pointer into
      the struct holding the tmp variable.
      
      Adding this inline code to test_tcpbpf_kern will now be generated
      correctly from,
      
        9: r2 = *(u32 *)(r2 + 96)
      
      to xlated code,
      
        12: (7b) *(u64 *)(r2 +32) = r9
        13: (61) r9 = *(u32 *)(r2 +28)
        14: (15) if r9 == 0x0 goto pc+4
        15: (79) r9 = *(u64 *)(r2 +32)
        16: (79) r2 = *(u64 *)(r2 +0)
        17: (61) r2 = *(u32 *)(r2 +2348)
        18: (05) goto pc+1
        19: (79) r9 = *(u64 *)(r2 +32)
      
      And in the normal case we keep the original code, because really this
      is an edge case. From this,
      
        9: r2 = *(u32 *)(r6 + 96)
      
      to xlated code,
      
        22: (61) r2 = *(u32 *)(r6 +28)
        23: (15) if r2 == 0x0 goto pc+2
        24: (79) r2 = *(u64 *)(r6 +0)
        25: (61) r2 = *(u32 *)(r2 +2348)
      
      So three additional instructions if dst == src register, but I scanned
      my current code base and did not see this pattern anywhere so should
      not be a big deal. Further, it seems no one else has hit this or at
      least reported it so it must a fairly rare pattern.
      
      Fixes: 9b1f3d6e
      
       ("bpf: Refactor sock_ops_convert_ctx_access")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/159718347772.4728.2781381670567919577.stgit@john-Precision-5820-Tower
      fd09af01
    • Florian Westphal's avatar
      netfilter: nf_tables: free chain context when BINDING flag is missing · 59136aa3
      Florian Westphal authored
      
      
      syzbot found a memory leak in nf_tables_addchain() because the chain
      object is not free'd correctly on error.
      
      Fixes: d0e2c7de ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
      Reported-by: default avatar <syzbot+c99868fde67014f7e9f5@syzkaller.appspotmail.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      59136aa3
    • Florian Westphal's avatar
      netfilter: avoid ipv6 -> nf_defrag_ipv6 module dependency · 2404b73c
      Florian Westphal authored
      
      
      nf_ct_frag6_gather is part of nf_defrag_ipv6.ko, not ipv6 core.
      
      The current use of the netfilter ipv6 stub indirections  causes a module
      dependency between ipv6 and nf_defrag_ipv6.
      
      This prevents nf_defrag_ipv6 module from being removed because ipv6 can't
      be unloaded.
      
      Remove the indirection and always use a direct call.  This creates a
      depency from nf_conntrack_bridge to nf_defrag_ipv6 instead:
      
      modinfo nf_conntrack
      depends:        nf_conntrack,nf_defrag_ipv6,bridge
      
      .. and nf_conntrack already depends on nf_defrag_ipv6 anyway.
      
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2404b73c
    • Andrii Nakryiko's avatar
      bpf: Fix XDP FD-based attach/detach logic around XDP_FLAGS_UPDATE_IF_NOEXIST · 068d9d1e
      Andrii Nakryiko authored
      
      
      Enforce XDP_FLAGS_UPDATE_IF_NOEXIST only if new BPF program to be attached is
      non-NULL (i.e., we are not detaching a BPF program).
      
      Fixes: d4baa936 ("bpf, xdp: Extract common XDP program attachment logic")
      Reported-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Tested-by: default avatarStanislav Fomichev <sdf@google.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200812022923.1217922-1-andriin@fb.com
      068d9d1e
  6. Aug 12, 2020
  7. Aug 11, 2020
  8. Aug 10, 2020
    • Jason Baron's avatar
      tcp: correct read of TFO keys on big endian systems · f19008e6
      Jason Baron authored
      
      
      When TFO keys are read back on big endian systems either via the global
      sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
      don't match what was written.
      
      For example, on s390x:
      
      # echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
      # cat /proc/sys/net/ipv4/tcp_fastopen_key
      02000000-01000000-04000000-03000000
      
      Instead of:
      
      # cat /proc/sys/net/ipv4/tcp_fastopen_key
      00000001-00000002-00000003-00000004
      
      Fix this by converting to the correct endianness on read. This was
      reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
      selftest on s390x, which depends on the read value matching what was
      written. I've confirmed that the test now passes on big and little endian
      systems.
      
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Fixes: 438ac880 ("net: fastopen: robustness and endianness fixes for SipHash")
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-and-tested-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f19008e6
    • Christoph Hellwig's avatar
      net: Revert "net: optimize the sockptr_t for unified kernel/user address spaces" · 519a8a6c
      Christoph Hellwig authored
      
      
      This reverts commits 6d04fe15 and
      a31edb20.
      
      It turns out the idea to share a single pointer for both kernel and user
      space address causes various kinds of problems.  So use the slightly less
      optimal version that uses an extra bit, but which is guaranteed to be safe
      everywhere.
      
      Fixes: 6d04fe15 ("net: optimize the sockptr_t for unified kernel/user address spaces")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      519a8a6c
    • Florian Westphal's avatar
      netfilter: nft_compat: remove flush counter optimization · 2f941622
      Florian Westphal authored
      
      
      WARNING: CPU: 1 PID: 16059 at lib/refcount.c:31 refcount_warn_saturate+0xdf/0xf
      [..]
       __nft_mt_tg_destroy+0x42/0x50 [nft_compat]
       nft_target_destroy+0x63/0x80 [nft_compat]
       nf_tables_expr_destroy+0x1b/0x30 [nf_tables]
       nf_tables_rule_destroy+0x3a/0x70 [nf_tables]
       nf_tables_exit_net+0x186/0x3d0 [nf_tables]
      
      Happens when a compat expr is destoyed from abort path.
      There is no functional impact; after this work queue is flushed
      unconditionally if its pending.
      
      This removes the waitcount optimization.  Test of repeated
      iptables-restore of a ~60k kubernetes ruleset doesn't indicate
      a slowdown.  In case the counter is needed after all for some workloads
      we can revert this and increment the refcount for the
      != NFT_PREPARE_TRANS case to avoid the increment/decrement imbalance.
      
      While at it, also flush for match case, this was an oversight
      in the original patch.
      
      Fixes: ffe8923f ("netfilter: nft_compat: make sure xtables destructors have run")
      Reported-by: default avatarkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      2f941622
    • Stephen Suryaputra's avatar
      netfilter: nf_tables: nft_exthdr: the presence return value should be little-endian · b4283366
      Stephen Suryaputra authored
      
      
      On big-endian machine, the returned register data when the exthdr is
      present is not being compared correctly because little-endian is
      assumed. The function nft_cmp_fast_mask(), called by nft_cmp_fast_eval()
      and nft_cmp_fast_init(), calls cpu_to_le32().
      
      The following dump also shows that little endian is assumed:
      
      $ nft --debug=netlink add rule ip recordroute forward ip option rr exists counter
      ip
        [ exthdr load ipv4 1b @ 7 + 0 present => reg 1 ]
        [ cmp eq reg 1 0x01000000 ]
        [ counter pkts 0 bytes 0 ]
      
      Lastly, debug print in nft_cmp_fast_init() and nft_cmp_fast_eval() when
      RR option exists in the packet shows that the comparison fails because
      the assumption:
      
      nft_cmp_fast_init:189 priv->sreg=4 desc.len=8 mask=0xff000000 data.data[0]=0x10003e0
      nft_cmp_fast_eval:57 regs->data[priv->sreg=4]=0x1 mask=0xff000000 priv->data=0x1000000
      
      v2: use nft_reg_store8() instead (Florian Westphal). Also to avoid the
          warnings reported by kernel test robot.
      
      Fixes: dbb5281a ("netfilter: nf_tables: add support for matching IPv4 options")
      Fixes: c078ca3b ("netfilter: nft_exthdr: Add support for existence check")
      Signed-off-by: default avatarStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b4283366
  9. Aug 08, 2020
Loading