Skip to content
  1. Aug 05, 2019
  2. Aug 03, 2019
    • Dexuan Cui's avatar
      hv_sock: Fix hang when a connection is closed · 685703b4
      Dexuan Cui authored
      
      
      There is a race condition for an established connection that is being closed
      by the guest: the refcnt is 4 at the end of hvs_release() (Note: here the
      'remove_sock' is false):
      
      1 for the initial value;
      1 for the sk being in the bound list;
      1 for the sk being in the connected list;
      1 for the delayed close_work.
      
      After hvs_release() finishes, __vsock_release() -> sock_put(sk) *may*
      decrease the refcnt to 3.
      
      Concurrently, hvs_close_connection() runs in another thread:
        calls vsock_remove_sock() to decrease the refcnt by 2;
        call sock_put() to decrease the refcnt to 0, and free the sk;
        next, the "release_sock(sk)" may hang due to use-after-free.
      
      In the above, after hvs_release() finishes, if hvs_close_connection() runs
      faster than "__vsock_release() -> sock_put(sk)", then there is not any issue,
      because at the beginning of hvs_close_connection(), the refcnt is still 4.
      
      The issue can be resolved if an extra reference is taken when the
      connection is established.
      
      Fixes: a9eeb998 ("hv_sock: Add support for delayed close")
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarSunil Muthuswamy <sunilmut@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      685703b4
  3. Aug 01, 2019
    • Taras Kondratiuk's avatar
      tipc: compat: allow tipc commands without arguments · 4da5f001
      Taras Kondratiuk authored
      
      
      Commit 2753ca5d ("tipc: fix uninit-value in tipc_nl_compat_doit")
      broke older tipc tools that use compat interface (e.g. tipc-config from
      tipcutils package):
      
      % tipc-config -p
      operation not supported
      
      The commit started to reject TIPC netlink compat messages that do not
      have attributes. It is too restrictive because some of such messages are
      valid (they don't need any arguments):
      
      % grep 'tx none' include/uapi/linux/tipc_config.h
      #define  TIPC_CMD_NOOP              0x0000    /* tx none, rx none */
      #define  TIPC_CMD_GET_MEDIA_NAMES   0x0002    /* tx none, rx media_name(s) */
      #define  TIPC_CMD_GET_BEARER_NAMES  0x0003    /* tx none, rx bearer_name(s) */
      #define  TIPC_CMD_SHOW_PORTS        0x0006    /* tx none, rx ultra_string */
      #define  TIPC_CMD_GET_REMOTE_MNG    0x4003    /* tx none, rx unsigned */
      #define  TIPC_CMD_GET_MAX_PORTS     0x4004    /* tx none, rx unsigned */
      #define  TIPC_CMD_GET_NETID         0x400B    /* tx none, rx unsigned */
      #define  TIPC_CMD_NOT_NET_ADMIN     0xC001    /* tx none, rx none */
      
      This patch relaxes the original fix and rejects messages without
      arguments only if such arguments are expected by a command (reg_type is
      non zero).
      
      Fixes: 2753ca5d ("tipc: fix uninit-value in tipc_nl_compat_doit")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTaras Kondratiuk <takondra@cisco.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4da5f001
  4. Jul 31, 2019
    • Nikolay Aleksandrov's avatar
      net: bridge: mcast: don't delete permanent entries when fast leave is enabled · 5c725b6b
      Nikolay Aleksandrov authored
      
      
      When permanent entries were introduced by the commit below, they were
      exempt from timing out and thus igmp leave wouldn't affect them unless
      fast leave was enabled on the port which was added before permanent
      entries existed. It shouldn't matter if fast leave is enabled or not
      if the user added a permanent entry it shouldn't be deleted on igmp
      leave.
      
      Before:
      $ echo 1 > /sys/class/net/eth4/brport/multicast_fast_leave
      $ bridge mdb add dev br0 port eth4 grp 229.1.1.1 permanent
      $ bridge mdb show
      dev br0 port eth4 grp 229.1.1.1 permanent
      
      < join and leave 229.1.1.1 on eth4 >
      
      $ bridge mdb show
      $
      
      After:
      $ echo 1 > /sys/class/net/eth4/brport/multicast_fast_leave
      $ bridge mdb add dev br0 port eth4 grp 229.1.1.1 permanent
      $ bridge mdb show
      dev br0 port eth4 grp 229.1.1.1 permanent
      
      < join and leave 229.1.1.1 on eth4 >
      
      $ bridge mdb show
      dev br0 port eth4 grp 229.1.1.1 permanent
      
      Fixes: ccb1c31a ("bridge: add flags to distinguish permanent mdb entires")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c725b6b
  5. Jul 30, 2019
    • Arnd Bergmann's avatar
      compat_ioctl: pppoe: fix PPPOEIOCSFWD handling · 055d8824
      Arnd Bergmann authored
      
      
      Support for handling the PPPOEIOCSFWD ioctl in compat mode was added in
      linux-2.5.69 along with hundreds of other commands, but was always broken
      sincen only the structure is compatible, but the command number is not,
      due to the size being sizeof(size_t), or at first sizeof(sizeof((struct
      sockaddr_pppox)), which is different on 64-bit architectures.
      
      Guillaume Nault adds:
      
        And the implementation was broken until 2016 (see 29e73269 ("pppoe:
        fix reference counting in PPPoE proxy")), and nobody ever noticed. I
        should probably have removed this ioctl entirely instead of fixing it.
        Clearly, it has never been used.
      
      Fix it by adding a compat_ioctl handler for all pppoe variants that
      translates the command number and then calls the regular ioctl function.
      
      All other ioctl commands handled by pppoe are compatible between 32-bit
      and 64-bit, and require compat_ptr() conversion.
      
      This should apply to all stable kernels.
      
      Acked-by: default avatarGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      055d8824
    • Jon Maloy's avatar
      tipc: fix unitilized skb list crash · 2948a1fc
      Jon Maloy authored
      
      
      Our test suite somtimes provokes the following crash:
      
      Description of problem:
      [ 1092.597234] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e8
      [ 1092.605072] PGD 0 P4D 0
      [ 1092.607620] Oops: 0000 [#1] SMP PTI
      [ 1092.611118] CPU: 37 PID: 0 Comm: swapper/37 Kdump: loaded Not tainted 4.18.0-122.el8.x86_64 #1
      [ 1092.619724] Hardware name: Dell Inc. PowerEdge R740/08D89F, BIOS 1.3.7 02/08/2018
      [ 1092.627215] RIP: 0010:tipc_mcast_filter_msg+0x93/0x2d0 [tipc]
      [ 1092.632955] Code: 0f 84 aa 01 00 00 89 cf 4d 01 ca 4c 8b 26 c1 ef 19 83 e7 0f 83 ff 0c 4d 0f 45 d1 41 8b 6a 10 0f cd 4c 39 e6 0f 84 81 01 00 00 <4d> 8b 9c 24 e8 00 00 00 45 8b 13 41 0f ca 44 89 d7 c1 ef 13 83 e7
      [ 1092.651703] RSP: 0018:ffff929e5fa83a18 EFLAGS: 00010282
      [ 1092.656927] RAX: ffff929e3fb38100 RBX: 00000000069f29ee RCX: 00000000416c0045
      [ 1092.664058] RDX: ffff929e5fa83a88 RSI: ffff929e31a28420 RDI: 0000000000000000
      [ 1092.671209] RBP: 0000000029b11821 R08: 0000000000000000 R09: ffff929e39b4407a
      [ 1092.678343] R10: ffff929e39b4407a R11: 0000000000000007 R12: 0000000000000000
      [ 1092.685475] R13: 0000000000000001 R14: ffff929e3fb38100 R15: ffff929e39b4407a
      [ 1092.692614] FS:  0000000000000000(0000) GS:ffff929e5fa80000(0000) knlGS:0000000000000000
      [ 1092.700702] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1092.706447] CR2: 00000000000000e8 CR3: 000000031300a004 CR4: 00000000007606e0
      [ 1092.713579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1092.720712] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1092.727843] PKRU: 55555554
      [ 1092.730556] Call Trace:
      [ 1092.733010]  <IRQ>
      [ 1092.735034]  tipc_sk_filter_rcv+0x7ca/0xb80 [tipc]
      [ 1092.739828]  ? __kmalloc_node_track_caller+0x1cb/0x290
      [ 1092.744974]  ? dev_hard_start_xmit+0xa5/0x210
      [ 1092.749332]  tipc_sk_rcv+0x389/0x640 [tipc]
      [ 1092.753519]  tipc_sk_mcast_rcv+0x23c/0x3a0 [tipc]
      [ 1092.758224]  tipc_rcv+0x57a/0xf20 [tipc]
      [ 1092.762154]  ? ktime_get_real_ts64+0x40/0xe0
      [ 1092.766432]  ? tpacket_rcv+0x50/0x9f0
      [ 1092.770098]  tipc_l2_rcv_msg+0x4a/0x70 [tipc]
      [ 1092.774452]  __netif_receive_skb_core+0xb62/0xbd0
      [ 1092.779164]  ? enqueue_entity+0xf6/0x630
      [ 1092.783084]  ? kmem_cache_alloc+0x158/0x1c0
      [ 1092.787272]  ? __build_skb+0x25/0xd0
      [ 1092.790849]  netif_receive_skb_internal+0x42/0xf0
      [ 1092.795557]  napi_gro_receive+0xba/0xe0
      [ 1092.799417]  mlx5e_handle_rx_cqe+0x83/0xd0 [mlx5_core]
      [ 1092.804564]  mlx5e_poll_rx_cq+0xd5/0x920 [mlx5_core]
      [ 1092.809536]  mlx5e_napi_poll+0xb2/0xce0 [mlx5_core]
      [ 1092.814415]  ? __wake_up_common_lock+0x89/0xc0
      [ 1092.818861]  net_rx_action+0x149/0x3b0
      [ 1092.822616]  __do_softirq+0xe3/0x30a
      [ 1092.826193]  irq_exit+0x100/0x110
      [ 1092.829512]  do_IRQ+0x85/0xd0
      [ 1092.832483]  common_interrupt+0xf/0xf
      [ 1092.836147]  </IRQ>
      [ 1092.838255] RIP: 0010:cpuidle_enter_state+0xb7/0x2a0
      [ 1092.843221] Code: e8 3e 79 a5 ff 80 7c 24 03 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 d7 01 00 00 31 ff e8 a0 6b ab ff fb 66 0f 1f 44 00 00 <48> b8 ff ff ff ff f3 01 00 00 4c 29 f3 ba ff ff ff 7f 48 39 c3 7f
      [ 1092.861967] RSP: 0018:ffffaa5ec6533e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      [ 1092.869530] RAX: ffff929e5faa3100 RBX: 000000fe63dd2092 RCX: 000000000000001f
      [ 1092.876665] RDX: 000000fe63dd2092 RSI: 000000003a518aaa RDI: 0000000000000000
      [ 1092.883795] RBP: 0000000000000003 R08: 0000000000000004 R09: 0000000000022940
      [ 1092.890929] R10: 0000040cb0666b56 R11: ffff929e5faa20a8 R12: ffff929e5faade78
      [ 1092.898060] R13: ffffffffb59258f8 R14: 000000fe60f3228d R15: 0000000000000000
      [ 1092.905196]  ? cpuidle_enter_state+0x92/0x2a0
      [ 1092.909555]  do_idle+0x236/0x280
      [ 1092.912785]  cpu_startup_entry+0x6f/0x80
      [ 1092.916715]  start_secondary+0x1a7/0x200
      [ 1092.920642]  secondary_startup_64+0xb7/0xc0
      [...]
      
      The reason is that the skb list tipc_socket::mc_method.deferredq only
      is initialized for connectionless sockets, while nothing stops arriving
      multicast messages from being filtered by connection oriented sockets,
      with subsequent access to the said list.
      
      We fix this by initializing the list unconditionally at socket creation.
      This eliminates the crash, while the message still is dropped further
      down in tipc_sk_filter_rcv() as it should be.
      
      Reported-by: default avatarLi Shuang <shuali@redhat.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2948a1fc
    • David Howells's avatar
      rxrpc: Fix the lack of notification when sendmsg() fails on a DATA packet · c69565ee
      David Howells authored
      
      
      Fix the fact that a notification isn't sent to the recvmsg side to indicate
      a call failed when sendmsg() fails to transmit a DATA packet with the error
      ENETUNREACH, EHOSTUNREACH or ECONNREFUSED.
      
      Without this notification, the afs client just sits there waiting for the
      call to complete in some manner (which it's not now going to do), which
      also pins the rxrpc call in place.
      
      This can be seen if the client has a scope-level IPv6 address, but not a
      global-level IPv6 address, and we try and transmit an operation to a
      server's IPv6 address.
      
      Looking in /proc/net/rxrpc/calls shows completed calls just sat there with
      an abort code of RX_USER_ABORT and an error code of -ENETUNREACH.
      
      Fixes: c54e43d7 ("rxrpc: Fix missing start of call timeout")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Reviewed-by: default avatarJeffrey Altman <jaltman@auristor.com>
      c69565ee
    • David Howells's avatar
      rxrpc: Fix potential deadlock · 60034d3d
      David Howells authored
      
      
      There is a potential deadlock in rxrpc_peer_keepalive_dispatch() whereby
      rxrpc_put_peer() is called with the peer_hash_lock held, but if it reduces
      the peer's refcount to 0, rxrpc_put_peer() calls __rxrpc_put_peer() - which
      the tries to take the already held lock.
      
      Fix this by providing a version of rxrpc_put_peer() that can be called in
      situations where the lock is already held.
      
      The bug may produce the following lockdep report:
      
      ============================================
      WARNING: possible recursive locking detected
      5.2.0-next-20190718 #41 Not tainted
      --------------------------------------------
      kworker/0:3/21678 is trying to acquire lock:
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
      /./include/linux/spinlock.h:343 [inline]
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
      __rxrpc_put_peer /net/rxrpc/peer_object.c:415 [inline]
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
      rxrpc_put_peer+0x2d3/0x6a0 /net/rxrpc/peer_object.c:435
      
      but task is already holding lock:
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at: spin_lock_bh
      /./include/linux/spinlock.h:343 [inline]
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
      rxrpc_peer_keepalive_dispatch /net/rxrpc/peer_event.c:378 [inline]
      00000000aa5eecdf (&(&rxnet->peer_hash_lock)->rlock){+.-.}, at:
      rxrpc_peer_keepalive_worker+0x6b3/0xd02 /net/rxrpc/peer_event.c:430
      
      Fixes: 330bdcfa ("rxrpc: Fix the keepalive generator [ver #2]")
      Reported-by: default avatar <syzbot+72af434e4b3417318f84@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Reviewed-by: default avatarJeffrey Altman <jaltman@auristor.com>
      60034d3d
    • Johannes Berg's avatar
      Revert "mac80211: set NETIF_F_LLTX when using intermediate tx queues" · eef347f8
      Johannes Berg authored
      
      
      Revert this for now, it has been reported multiple times that it
      completely breaks connectivity on various devices.
      
      Cc: stable@vger.kernel.org
      Fixes: 8dbb000e ("mac80211: set NETIF_F_LLTX when using intermediate tx queues")
      Reported-by: default avatarJean Delvare <jdelvare@suse.de>
      Reported-by: default avatarPeter Lebbing <peter@digitalbrains.com>
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      eef347f8
    • Florian Westphal's avatar
      netfilter: ebtables: also count base chain policies · 3b48300d
      Florian Westphal authored
      
      
      ebtables doesn't include the base chain policies in the rule count,
      so we need to add them manually when we call into the x_tables core
      to allocate space for the comapt offset table.
      
      This lead syzbot to trigger:
      WARNING: CPU: 1 PID: 9012 at net/netfilter/x_tables.c:649
      xt_compat_add_offset.cold+0x11/0x36 net/netfilter/x_tables.c:649
      
      Reported-by: default avatar <syzbot+276ddebab3382bbf72db@syzkaller.appspotmail.com>
      Fixes: 2035f3ff ("netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present")
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      3b48300d
  6. Jul 29, 2019
    • Enrico Weigelt's avatar
      net: sctp: drop unneeded likely() call around IS_ERR() · d4e575ba
      Enrico Weigelt authored
      
      
      IS_ERR() already calls unlikely(), so this extra unlikely() call
      around IS_ERR() is not needed.
      
      Signed-off-by: default avatarEnrico Weigelt <info@metux.net>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4e575ba
    • Jozsef Kadlecsik's avatar
      netfilter: ipset: Fix rename concurrency with listing · 6c1f7e2c
      Jozsef Kadlecsik authored
      
      
      Shijie Luo reported that when stress-testing ipset with multiple concurrent
      create, rename, flush, list, destroy commands, it can result
      
      ipset <version>: Broken LIST kernel message: missing DATA part!
      
      error messages and broken list results. The problem was the rename operation
      was not properly handled with respect of listing. The patch fixes the issue.
      
      Reported-by: default avatarShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      6c1f7e2c
    • Stefano Brivio's avatar
      netfilter: ipset: Copy the right MAC address in bitmap:ip,mac and hash:ip,mac sets · 1b4a7510
      Stefano Brivio authored
      
      
      In commit 8cc4ccf5 ("ipset: Allow matching on destination MAC address
      for mac and ipmac sets"), ipset.git commit 1543514c46a7, I added to the
      KADT functions for sets matching on MAC addreses the copy of source or
      destination MAC address depending on the configured match.
      
      This was done correctly for hash:mac, but for hash:ip,mac and
      bitmap:ip,mac, copying and pasting the same code block presents an
      obvious problem: in these two set types, the MAC address is the second
      dimension, not the first one, and we are actually selecting the MAC
      address depending on whether the first dimension (IP address) specifies
      source or destination.
      
      Fix this by checking for the IPSET_DIM_TWO_SRC flag in option flags.
      
      This way, mixing source and destination matches for the two dimensions
      of ip,mac set types works as expected. With this setup:
      
        ip netns add A
        ip link add veth1 type veth peer name veth2 netns A
        ip addr add 192.0.2.1/24 dev veth1
        ip -net A addr add 192.0.2.2/24 dev veth2
        ip link set veth1 up
        ip -net A link set veth2 up
      
        dst=$(ip netns exec A cat /sys/class/net/veth2/address)
      
        ip netns exec A ipset create test_bitmap bitmap:ip,mac range 192.0.0.0/16
        ip netns exec A ipset add test_bitmap 192.0.2.1,${dst}
        ip netns exec A iptables -A INPUT -m set ! --match-set test_bitmap src,dst -j DROP
      
        ip netns exec A ipset create test_hash hash:ip,mac
        ip netns exec A ipset add test_hash 192.0.2.1,${dst}
        ip netns exec A iptables -A INPUT -m set ! --match-set test_hash src,dst -j DROP
      
      ipset correctly matches a test packet:
      
        # ping -c1 192.0.2.2 >/dev/null
        # echo $?
        0
      
      Reported-by: default avatarChen Yi <yiche@redhat.com>
      Fixes: 8cc4ccf5 ("ipset: Allow matching on destination MAC address for mac and ipmac sets")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      1b4a7510
    • Stefano Brivio's avatar
      netfilter: ipset: Actually allow destination MAC address for hash:ip,mac sets too · b89d1548
      Stefano Brivio authored
      
      
      In commit 8cc4ccf5 ("ipset: Allow matching on destination MAC address
      for mac and ipmac sets"), ipset.git commit 1543514c46a7, I removed the
      KADT check that prevents matching on destination MAC addresses for
      hash:mac sets, but forgot to remove the same check for hash:ip,mac set.
      
      Drop this check: functionality is now commented in man pages and there's
      no reason to restrict to source MAC address matching anymore.
      
      Reported-by: default avatarChen Yi <yiche@redhat.com>
      Fixes: 8cc4ccf5 ("ipset: Allow matching on destination MAC address for mac and ipmac sets")
      Signed-off-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarJozsef Kadlecsik <kadlec@netfilter.org>
      b89d1548
    • Jiri Pirko's avatar
      net: fix ifindex collision during namespace removal · 55b40dbf
      Jiri Pirko authored
      
      
      Commit aca51397 ("netns: Fix arbitrary net_device-s corruptions
      on net_ns stop.") introduced a possibility to hit a BUG in case device
      is returning back to init_net and two following conditions are met:
      1) dev->ifindex value is used in a name of another "dev%d"
         device in init_net.
      2) dev->name is used by another device in init_net.
      
      Under real life circumstances this is hard to get. Therefore this has
      been present happily for over 10 years. To reproduce:
      
      $ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
      3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
      $ ip netns add ns1
      $ ip -n ns1 link add dummy1ns1 type dummy
      $ ip -n ns1 link add dummy2ns1 type dummy
      $ ip link set enp0s2 netns ns1
      $ ip -n ns1 link set enp0s2 name dummy0
      [  100.858894] virtio_net virtio0 dummy0: renamed from enp0s2
      $ ip link add dev4 type dummy
      $ ip -n ns1 a
      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff
      3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff
      4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
      $ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
      4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff
      $ ip netns del ns1
      [  158.717795] default_device_exit: failed to move dummy0 to init_net: -17
      [  158.719316] ------------[ cut here ]------------
      [  158.720591] kernel BUG at net/core/dev.c:9824!
      [  158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI
      [  158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18
      [  158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  158.727508] Workqueue: netns cleanup_net
      [  158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f
      [  158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
      [  158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
      [  158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
      [  158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
      [  158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
      [  158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
      [  158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
      [  158.750638] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
      [  158.752944] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
      [  158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  158.762758] Call Trace:
      [  158.763882]  ? dev_change_net_namespace+0xbb0/0xbb0
      [  158.766148]  ? devlink_nl_cmd_set_doit+0x520/0x520
      [  158.768034]  ? dev_change_net_namespace+0xbb0/0xbb0
      [  158.769870]  ops_exit_list.isra.0+0xa8/0x150
      [  158.771544]  cleanup_net+0x446/0x8f0
      [  158.772945]  ? unregister_pernet_operations+0x4a0/0x4a0
      [  158.775294]  process_one_work+0xa1a/0x1740
      [  158.776896]  ? pwq_dec_nr_in_flight+0x310/0x310
      [  158.779143]  ? do_raw_spin_lock+0x11b/0x280
      [  158.780848]  worker_thread+0x9e/0x1060
      [  158.782500]  ? process_one_work+0x1740/0x1740
      [  158.784454]  kthread+0x31b/0x420
      [  158.786082]  ? __kthread_create_on_node+0x3f0/0x3f0
      [  158.788286]  ret_from_fork+0x3a/0x50
      [  158.789871] ---[ end trace defd6c657c71f936 ]---
      [  158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f
      [  158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
      [  158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
      [  158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
      [  158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
      [  158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
      [  158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
      [  158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
      [  158.829899] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
      [  158.834923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
      [  158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fix this by checking if a device with the same name exists in init_net
      and fallback to original code - dev%d to allocate name - in case it does.
      
      This was found using syzkaller.
      
      Fixes: aca51397 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55b40dbf
    • Gustavo A. R. Silva's avatar
      net/af_iucv: mark expected switch fall-throughs · 05bba1ed
      Gustavo A. R. Silva authored
      
      
      Mark switch cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      net/iucv/af_iucv.c: warning: this statement may fall
      through [-Wimplicit-fallthrough=]:  => 537:3, 519:6, 2246:6, 510:6
      
      Notice that, in this particular case, the code comment is
      modified in accordance with what GCC is expecting to find.
      
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05bba1ed
    • Nikolay Aleksandrov's avatar
      net: bridge: delete local fdb on device init failure · d7bae09f
      Nikolay Aleksandrov authored
      
      
      On initialization failure we have to delete the local fdb which was
      inserted due to the default pvid creation. This problem has been present
      since the inception of default_pvid. Note that currently there are 2 cases:
      1) in br_dev_init() when br_multicast_init() fails
      2) if register_netdevice() fails after calling ndo_init()
      
      This patch takes care of both since br_vlan_flush() is called on both
      occasions. Also the new fdb delete would be a no-op on normal bridge
      device destruction since the local fdb would've been already flushed by
      br_dev_delete(). This is not an issue for ports since nbp_vlan_init() is
      called last when adding a port thus nothing can fail after it.
      
      Reported-by: default avatar <syzbot+88533dc8b582309bf3ee@syzkaller.appspotmail.com>
      Fixes: 5be5a2df ("bridge: Add filtering support for default_pvid")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7bae09f
    • Jia-Ju Bai's avatar
      net: sched: Fix a possible null-pointer dereference in dequeue_func() · 051c7b39
      Jia-Ju Bai authored
      
      
      In dequeue_func(), there is an if statement on line 74 to check whether
      skb is NULL:
          if (skb)
      
      When skb is NULL, it is used on line 77:
          prefetch(&skb->end);
      
      Thus, a possible null-pointer dereference may occur.
      
      To fix this bug, skb->end is used when skb is not NULL.
      
      This bug is found by a static analysis tool STCheck written by us.
      
      Fixes: 76e3cc12 ("codel: Controlled Delay AQM")
      Signed-off-by: default avatarJia-Ju Bai <baijiaju1990@gmail.com>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      051c7b39
    • Brian Norris's avatar
      mac80211: don't WARN on short WMM parameters from AP · 05aaa5c9
      Brian Norris authored
      
      
      In a very similar spirit to commit c470bdc1 ("mac80211: don't WARN
      on bad WMM parameters from buggy APs"), an AP may not transmit a
      fully-formed WMM IE. For example, it may miss or repeat an Access
      Category. The above loop won't catch that and will instead leave one of
      the four ACs zeroed out. This triggers the following warning in
      drv_conf_tx()
      
        wlan0: invalid CW_min/CW_max: 0/0
      
      and it may leave one of the hardware queues unconfigured. If we detect
      such a case, let's just print a warning and fall back to the defaults.
      
      Tested with a hacked version of hostapd, intentionally corrupting the
      IEs in hostapd_eid_wmm().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarBrian Norris <briannorris@chromium.org>
      Link: https://lore.kernel.org/r/20190726224758.210953-1-briannorris@chromium.org
      
      
      Signed-off-by: default avatarJohannes Berg <johannes.berg@intel.com>
      05aaa5c9
  7. Jul 27, 2019
  8. Jul 26, 2019
  9. Jul 25, 2019
  10. Jul 24, 2019
  11. Jul 23, 2019
    • Eric Dumazet's avatar
      bpf: fix access to skb_shared_info->gso_segs · 06a22d89
      Eric Dumazet authored
      
      
      It is possible we reach bpf_convert_ctx_access() with
      si->dst_reg == si->src_reg
      
      Therefore, we need to load BPF_REG_AX before eventually
      mangling si->src_reg.
      
      syzbot generated this x86 code :
         3:   55                      push   %rbp
         4:   48 89 e5                mov    %rsp,%rbp
         7:   48 81 ec 00 00 00 00    sub    $0x0,%rsp // Might be avoided ?
         e:   53                      push   %rbx
         f:   41 55                   push   %r13
        11:   41 56                   push   %r14
        13:   41 57                   push   %r15
        15:   6a 00                   pushq  $0x0
        17:   31 c0                   xor    %eax,%eax
        19:   48 8b bf c0 00 00 00    mov    0xc0(%rdi),%rdi
        20:   44 8b 97 bc 00 00 00    mov    0xbc(%rdi),%r10d
        27:   4c 01 d7                add    %r10,%rdi
        2a:   48 0f b7 7f 06          movzwq 0x6(%rdi),%rdi // Crash
        2f:   5b                      pop    %rbx
        30:   41 5f                   pop    %r15
        32:   41 5e                   pop    %r14
        34:   41 5d                   pop    %r13
        36:   5b                      pop    %rbx
        37:   c9                      leaveq
        38:   c3                      retq
      
      Fixes: d9ff286a ("bpf: allow BPF programs access skb_shared_info->gso_segs field")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      06a22d89
  12. Jul 22, 2019
    • John Fastabend's avatar
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend authored
      
      
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: default avatar <syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com>
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
    • John Fastabend's avatar
      bpf: sockmap, only create entry if ulp is not already enabled · 0e858739
      John Fastabend authored
      
      
      Sockmap does not currently support adding sockets after TLS has been
      enabled. There never was a real use case for this so it was never
      added. But, we lost the test for ULP at some point so add it here
      and fail the socket insert if TLS is enabled. Future work could
      make sockmap support this use case but fixup the bug here.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0e858739
    • John Fastabend's avatar
      bpf: sockmap, synchronize_rcu before free'ing map · 2bb90e5c
      John Fastabend authored
      
      
      We need to have a synchronize_rcu before free'ing the sockmap because
      any outstanding psock references will have a pointer to the map and
      when they use this could trigger a use after free.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2bb90e5c
    • John Fastabend's avatar
      bpf: sockmap, sock_map_delete needs to use xchg · 45a4521d
      John Fastabend authored
      
      
      __sock_map_delete() may be called from a tcp event such as unhash or
      close from the following trace,
      
        tcp_bpf_close()
          tcp_bpf_remove()
            sk_psock_unlink()
              sock_map_delete_from_link()
                __sock_map_delete()
      
      In this case the sock lock is held but this only protects against
      duplicate removals on the TCP side. If the map is free'd then we have
      this trace,
      
        sock_map_free
          xchg()                  <- replaces map entry
          sock_map_unref()
            sk_psock_put()
              sock_map_del_link()
      
      The __sock_map_delete() call however uses a read, test, null over the
      map entry which can result in both paths trying to free the map
      entry.
      
      To fix use xchg in TCP paths as well so we avoid having two references
      to the same map entry.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      45a4521d
    • John Fastabend's avatar
      net/tls: fix transition through disconnect with close · 32857cf5
      John Fastabend authored
      
      
      It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
      state via tcp_disconnect() without actually calling tcp_close which
      would then call the tls close callback. Because of this a user could
      disconnect a socket then put it in a LISTEN state which would break
      our assumptions about sockets always being ESTABLISHED state.
      
      More directly because close() can call unhash() and unhash is
      implemented by sockmap if a sockmap socket has TLS enabled we can
      incorrectly destroy the psock from unhash() and then call its close
      handler again. But because the psock (sockmap socket representation)
      is already destroyed we call close handler in sk->prot. However,
      in some cases (TLS BASE/BASE case) this will still point at the
      sockmap close handler resulting in a circular call and crash reported
      by syzbot.
      
      To fix both above issues implement the unhash() routine for TLS.
      
      v4:
       - add note about tls offload still needing the fix;
       - move sk_proto to the cold cache line;
       - split TX context free into "release" and "free",
         otherwise the GC work itself is in already freed
         memory;
       - more TX before RX for consistency;
       - reuse tls_ctx_free();
       - schedule the GC work after we're done with context
         to avoid UAF;
       - don't set the unhash in all modes, all modes "inherit"
         TLS_BASE's callbacks anyway;
       - disable the unhash hook for TLS_HW.
      
      Fixes: 3c4d7559 ("tls: kernel TLS support")
      Reported-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      32857cf5
Loading