Skip to content
  1. Feb 08, 2022
  2. Feb 03, 2022
    • Duoming Zhou's avatar
      ax25: fix reference count leaks of ax25_dev · 87563a04
      Duoming Zhou authored
      
      
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev
      to avoid UAF bugs") introduces refcount into ax25_dev, but there
      are reference leak paths in ax25_ctl_ioctl(), ax25_fwd_ioctl(),
      ax25_rt_add(), ax25_rt_del() and ax25_rt_opt().
      
      This patch uses ax25_dev_put() and adjusts the position of
      ax25_addr_ax25dev() to fix reference cout leaks of ax25_dev.
      
      Fixes: d01ffb9e ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Reviewed-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Link: https://lore.kernel.org/r/20220203150811.42256-1-duoming@zju.edu.cn
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      87563a04
    • Daniel Borkmann's avatar
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann authored
      
      
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: default avatar <syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4a81f6da
    • Eric Dumazet's avatar
      tcp: add missing tcp_skb_can_collapse() test in tcp_shift_skb_data() · b67985be
      Eric Dumazet authored
      
      
      tcp_shift_skb_data() might collapse three packets into a larger one.
      
      P_A, P_B, P_C  -> P_ABC
      
      Historically, it used a single tcp_skb_can_collapse_to(P_A) call,
      because it was enough.
      
      In commit 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions"),
      this call was replaced by a call to tcp_skb_can_collapse(P_A, P_B)
      
      But the now needed test over P_C has been missed.
      
      This probably broke MPTCP.
      
      Then later, commit 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      added an extra condition to tcp_skb_can_collapse(), but the missing call
      from tcp_shift_skb_data() is also breaking TCP zerocopy, because P_A and P_C
      might have different skb_zcopy_pure() status.
      
      Fixes: 85712484 ("tcp: coalesce/collapse must respect MPTCP extensions")
      Fixes: 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220201184640.756716-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b67985be
  3. Feb 02, 2022
    • Ilya Dryomov's avatar
      libceph: optionally use bounce buffer on recv path in crc mode · 038b8d1d
      Ilya Dryomov authored
      Both msgr1 and msgr2 in crc mode are zero copy in the sense that
      message data is read from the socket directly into the destination
      buffer.  We assume that the destination buffer is stable (i.e. remains
      unchanged while it is being read to) though.  Otherwise, CRC errors
      ensue:
      
        libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706
        libceph: osd1 (1)192.168.122.1:6843 bad crc/signature
      
        libceph: bad data crc, calculated 57958023, expected 1805382778
        libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc
      
      Introduce rxbounce option to enable use of a bounce buffer when
      receiving message data.  In particular this is needed if a mapped
      image is a Windows VM disk, passed to QEMU.  Windows has a system-wide
      "dummy" page that may be mapped into the destination buffer (potentially
      more than once into the same buffer) by the Windows Memory Manager in
      an effort to generate a single large I/O [1][2].  QEMU makes a point of
      preserving overlap relationships when cloning I/O vectors, so krbd gets
      exposed to this behaviour.
      
      [1] "What Is Really in That MDL?"
          https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85)
      [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/
      
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317
      
      
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      038b8d1d
    • Ilya Dryomov's avatar
      libceph: make recv path in secure mode work the same as send path · 2ea88716
      Ilya Dryomov authored
      The recv path of secure mode is intertwined with that of crc mode.
      While it's slightly more efficient that way (the ciphertext is read
      into the destination buffer and decrypted in place, thus avoiding
      two potentially heavy memory allocations for the bounce buffer and
      the corresponding sg array), it isn't really amenable to changes.
      Sacrifice that edge and align with the send path which always uses
      a full-sized bounce buffer (currently there is no other way -- if
      the kernel crypto API ever grows support for streaming (piecewise)
      en/decryption for GCM [1], we would be able to easily take advantage
      of that on both sides).
      
      [1] https://lore.kernel.org/all/20141225202830.GA18794@gondor.apana.org.au/
      
      
      
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarJeff Layton <jlayton@kernel.org>
      2ea88716
    • Dmitry V. Levin's avatar
      Partially revert "net/smc: Add netlink net namespace support" · c86d8613
      Dmitry V. Levin authored
      
      
      The change of sizeof(struct smc_diag_linkinfo) by commit 79d39fc5
      ("net/smc: Add netlink net namespace support") introduced an ABI
      regression: since struct smc_diag_lgrinfo contains an object of
      type "struct smc_diag_linkinfo", offset of all subsequent members
      of struct smc_diag_lgrinfo was changed by that change.
      
      As result, applications compiled with the old version
      of struct smc_diag_linkinfo will receive garbage in
      struct smc_diag_lgrinfo.role if the kernel implements
      this new version of struct smc_diag_linkinfo.
      
      Fix this regression by reverting the part of commit 79d39fc5 that
      changes struct smc_diag_linkinfo.  After all, there is SMC_GEN_NETLINK
      interface which is good enough, so there is probably no need to touch
      the smc_diag ABI in the first place.
      
      Fixes: 79d39fc5 ("net/smc: Add netlink net namespace support")
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Reviewed-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220202030904.GA9742@altlinux.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c86d8613
    • Eric Dumazet's avatar
      tcp: fix mem under-charging with zerocopy sendmsg() · 479f5547
      Eric Dumazet authored
      
      
      We got reports of following warning in inet_sock_destruct()
      
      	WARN_ON(sk_forward_alloc_get(sk));
      
      Whenever we add a non zero-copy fragment to a pure zerocopy skb,
      we have to anticipate that whole skb->truesize will be uncharged
      when skb is finally freed.
      
      skb->data_len is the payload length. But the memory truesize
      estimated by __zerocopy_sg_from_iter() is page aligned.
      
      Fixes: 9b65b17d ("net: avoid double accounting for pure zerocopy skbs")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Talal Ahmad <talalahmad@google.com>
      Cc: Arjun Roy <arjunroy@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20220201065254.680532-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      479f5547
    • Eric Dumazet's avatar
      af_packet: fix data-race in packet_setsockopt / packet_setsockopt · e42e70ad
      Eric Dumazet authored
      
      
      When packet_setsockopt( PACKET_FANOUT_DATA ) reads po->fanout,
      no lock is held, meaning that another thread can change po->fanout.
      
      Given that po->fanout can only be set once during the socket lifetime
      (it is only cleared from fanout_release()), we can use
      READ_ONCE()/WRITE_ONCE() to document the race.
      
      BUG: KCSAN: data-race in packet_setsockopt / packet_setsockopt
      
      write to 0xffff88813ae8e300 of 8 bytes by task 14653 on cpu 0:
       fanout_add net/packet/af_packet.c:1791 [inline]
       packet_setsockopt+0x22fe/0x24a0 net/packet/af_packet.c:3931
       __sys_setsockopt+0x209/0x2a0 net/socket.c:2180
       __do_sys_setsockopt net/socket.c:2191 [inline]
       __se_sys_setsockopt net/socket.c:2188 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88813ae8e300 of 8 bytes by task 14654 on cpu 1:
       packet_setsockopt+0x691/0x24a0 net/packet/af_packet.c:3935
       __sys_setsockopt+0x209/0x2a0 net/socket.c:2180
       __do_sys_setsockopt net/socket.c:2191 [inline]
       __se_sys_setsockopt net/socket.c:2188 [inline]
       __x64_sys_setsockopt+0x62/0x70 net/socket.c:2188
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000000000000000 -> 0xffff888106f8c000
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 14654 Comm: syz-executor.3 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 47dceb8e ("packet: add classic BPF fanout mode")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220201022358.330621-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e42e70ad
    • Eric Dumazet's avatar
      rtnetlink: make sure to refresh master_dev/m_ops in __rtnl_newlink() · c6f6f244
      Eric Dumazet authored
      
      
      While looking at one unrelated syzbot bug, I found the replay logic
      in __rtnl_newlink() to potentially trigger use-after-free.
      
      It is better to clear master_dev and m_ops inside the loop,
      in case we have to replay it.
      
      Fixes: ba7d49b1 ("rtnetlink: provide api for getting and setting slave info")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Jiri Pirko <jiri@nvidia.com>
      Link: https://lore.kernel.org/r/20220201012106.216495-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c6f6f244
    • Eric Dumazet's avatar
      net: sched: fix use-after-free in tc_new_tfilter() · 04c2a47f
      Eric Dumazet authored
      
      
      Whenever tc_new_tfilter() jumps back to replay: label,
      we need to make sure @q and @chain local variables are cleared again,
      or risk use-after-free as in [1]
      
      For consistency, apply the same fix in tc_ctl_chain()
      
      BUG: KASAN: use-after-free in mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
      Write of size 8 at addr ffff8880985c4b08 by task syz-executor.4/1945
      
      CPU: 0 PID: 1945 Comm: syz-executor.4 Not tainted 5.17.0-rc1-syzkaller-00495-gff58831fa02d #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       mini_qdisc_pair_swap+0x1b9/0x1f0 net/sched/sch_generic.c:1581
       tcf_chain_head_change_item net/sched/cls_api.c:372 [inline]
       tcf_chain0_head_change.isra.0+0xb9/0x120 net/sched/cls_api.c:386
       tcf_chain_tp_insert net/sched/cls_api.c:1657 [inline]
       tcf_chain_tp_insert_unique net/sched/cls_api.c:1707 [inline]
       tc_new_tfilter+0x1e67/0x2350 net/sched/cls_api.c:2086
       rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f2647172059
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f2645aa5168 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
      RAX: ffffffffffffffda RBX: 00007f2647285100 RCX: 00007f2647172059
      RDX: 040000000000009f RSI: 00000000200002c0 RDI: 0000000000000006
      RBP: 00007f26471cc08d R08: 0000000000000000 R09: 0000000000000000
      R10: 9e00000000000000 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fffb3f7f02f R14: 00007f2645aa5300 R15: 0000000000022000
       </TASK>
      
      Allocated by task 1944:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:45 [inline]
       set_alloc_info mm/kasan/common.c:436 [inline]
       ____kasan_kmalloc mm/kasan/common.c:515 [inline]
       ____kasan_kmalloc mm/kasan/common.c:474 [inline]
       __kasan_kmalloc+0xa9/0xd0 mm/kasan/common.c:524
       kmalloc_node include/linux/slab.h:604 [inline]
       kzalloc_node include/linux/slab.h:726 [inline]
       qdisc_alloc+0xac/0xa10 net/sched/sch_generic.c:941
       qdisc_create.constprop.0+0xce/0x10f0 net/sched/sch_api.c:1211
       tc_modify_qdisc+0x4c5/0x1980 net/sched/sch_api.c:1660
       rtnetlink_rcv_msg+0x413/0xb80 net/core/rtnetlink.c:5592
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 3609:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:45
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free+0x130/0x160 mm/kasan/common.c:328
       kasan_slab_free include/linux/kasan.h:236 [inline]
       slab_free_hook mm/slub.c:1728 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1754
       slab_free mm/slub.c:3509 [inline]
       kfree+0xcb/0x280 mm/slub.c:4562
       rcu_do_batch kernel/rcu/tree.c:2527 [inline]
       rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2778
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:3026 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3106
       qdisc_put_unlocked+0x6f/0x90 net/sched/sch_generic.c:1109
       tcf_block_release+0x86/0x90 net/sched/cls_api.c:1238
       tc_new_tfilter+0xc0d/0x2350 net/sched/cls_api.c:2148
       rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:5583
       netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
       netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
       netlink_unicast+0x539/0x7e0 net/netlink/af_netlink.c:1343
       netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:725
       ____sys_sendmsg+0x331/0x810 net/socket.c:2413
       ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
       __sys_sendmmsg+0x195/0x470 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff8880985c4800
       which belongs to the cache kmalloc-1k of size 1024
      The buggy address is located 776 bytes inside of
       1024-byte region [ffff8880985c4800, ffff8880985c4c00)
      The buggy address belongs to the page:
      page:ffffea0002617000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x985c0
      head:ffffea0002617000 order:3 compound_mapcount:0 compound_pincount:0
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 0000000000000000 dead000000000122 ffff888010c41dc0
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 1941, ts 1038999441284, free_ts 1033444432829
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       __kmalloc+0x2fb/0x340 mm/slub.c:4420
       kmalloc include/linux/slab.h:586 [inline]
       kzalloc include/linux/slab.h:715 [inline]
       __register_sysctl_table+0x112/0x1090 fs/proc/proc_sysctl.c:1335
       neigh_sysctl_register+0x2c8/0x5e0 net/core/neighbour.c:3787
       devinet_sysctl_register+0xb1/0x230 net/ipv4/devinet.c:2618
       inetdev_init+0x286/0x580 net/ipv4/devinet.c:278
       inetdev_event+0xa8a/0x15d0 net/ipv4/devinet.c:1532
       notifier_call_chain+0xb5/0x200 kernel/notifier.c:84
       call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:1919
       call_netdevice_notifiers_extack net/core/dev.c:1931 [inline]
       call_netdevice_notifiers net/core/dev.c:1945 [inline]
       register_netdevice+0x1073/0x1500 net/core/dev.c:9698
       veth_newlink+0x59c/0xa90 drivers/net/veth.c:1722
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       release_pages+0x748/0x1220 mm/swap.c:956
       tlb_batch_pages_flush mm/mmu_gather.c:50 [inline]
       tlb_flush_mmu_free mm/mmu_gather.c:243 [inline]
       tlb_flush_mmu+0xe9/0x6b0 mm/mmu_gather.c:250
       zap_pte_range mm/memory.c:1441 [inline]
       zap_pmd_range mm/memory.c:1490 [inline]
       zap_pud_range mm/memory.c:1519 [inline]
       zap_p4d_range mm/memory.c:1540 [inline]
       unmap_page_range+0x1d1d/0x2a30 mm/memory.c:1561
       unmap_single_vma+0x198/0x310 mm/memory.c:1606
       unmap_vmas+0x16b/0x2f0 mm/memory.c:1638
       exit_mmap+0x201/0x670 mm/mmap.c:3178
       __mmput+0x122/0x4b0 kernel/fork.c:1114
       mmput+0x56/0x60 kernel/fork.c:1135
       exit_mm kernel/exit.c:507 [inline]
       do_exit+0xa3c/0x2a30 kernel/exit.c:793
       do_group_exit+0xd2/0x2f0 kernel/exit.c:935
       __do_sys_exit_group kernel/exit.c:946 [inline]
       __se_sys_exit_group kernel/exit.c:944 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:944
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff8880985c4a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880985c4a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff8880985c4b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
       ffff8880985c4b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880985c4c00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      
      Fixes: 470502de ("net: sched: unlock rules update API")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Vlad Buslov <vladbu@mellanox.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220131172018.3704490-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04c2a47f
  4. Jan 31, 2022
    • Wen Gu's avatar
      net/smc: Forward wakeup to smc socket waitqueue after fallback · 341adeec
      Wen Gu authored
      
      
      When we replace TCP with SMC and a fallback occurs, there may be
      some socket waitqueue entries remaining in smc socket->wq, such
      as eppoll_entries inserted by userspace applications.
      
      After the fallback, data flows over TCP/IP and only clcsocket->wq
      will be woken up. Applications can't be notified by the entries
      which were inserted in smc socket->wq before fallback. So we need
      a mechanism to wake up smc socket->wq at the same time if some
      entries remaining in it.
      
      The current workaround is to transfer the entries from smc socket->wq
      to clcsock->wq during the fallback. But this may cause a crash
      like this:
      
       general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
       CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E     5.16.0+ #107
       RIP: 0010:__wake_up_common+0x65/0x170
       Call Trace:
        <IRQ>
        __wake_up_common_lock+0x7a/0xc0
        sock_def_readable+0x3c/0x70
        tcp_data_queue+0x4a7/0xc40
        tcp_rcv_established+0x32f/0x660
        ? sk_filter_trim_cap+0xcb/0x2e0
        tcp_v4_do_rcv+0x10b/0x260
        tcp_v4_rcv+0xd2a/0xde0
        ip_protocol_deliver_rcu+0x3b/0x1d0
        ip_local_deliver_finish+0x54/0x60
        ip_local_deliver+0x6a/0x110
        ? tcp_v4_early_demux+0xa2/0x140
        ? tcp_v4_early_demux+0x10d/0x140
        ip_sublist_rcv_finish+0x49/0x60
        ip_sublist_rcv+0x19d/0x230
        ip_list_rcv+0x13e/0x170
        __netif_receive_skb_list_core+0x1c2/0x240
        netif_receive_skb_list_internal+0x1e6/0x320
        napi_complete_done+0x11d/0x190
        mlx5e_napi_poll+0x163/0x6b0 [mlx5_core]
        __napi_poll+0x3c/0x1b0
        net_rx_action+0x27c/0x300
        __do_softirq+0x114/0x2d2
        irq_exit_rcu+0xb4/0xe0
        common_interrupt+0xba/0xe0
        </IRQ>
        <TASK>
      
      The crash is caused by privately transferring waitqueue entries from
      smc socket->wq to clcsock->wq. The owners of these entries, such as
      epoll, have no idea that the entries have been transferred to a
      different socket wait queue and still use original waitqueue spinlock
      (smc socket->wq.wait.lock) to make the entries operation exclusive,
      but it doesn't work. The operations to the entries, such as removing
      from the waitqueue (now is clcsock->wq after fallback), may cause a
      crash when clcsock waitqueue is being iterated over at the moment.
      
      This patch tries to fix this by no longer transferring wait queue
      entries privately, but introducing own implementations of clcsock's
      callback functions in fallback situation. The callback functions will
      forward the wakeup to smc socket->wq if clcsock->wq is actually woken
      up and smc socket->wq has remaining entries.
      
      Fixes: 2153bd1e ("net/smc: Transfer remaining wait queue entries during fallback")
      Suggested-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      341adeec
  5. Jan 28, 2022
    • Duoming Zhou's avatar
      ax25: add refcount in ax25_dev to avoid UAF bugs · d01ffb9e
      Duoming Zhou authored
      
      
      If we dereference ax25_dev after we call kfree(ax25_dev) in
      ax25_dev_device_down(), it will lead to concurrency UAF bugs.
      There are eight syscall functions suffer from UAF bugs, include
      ax25_bind(), ax25_release(), ax25_connect(), ax25_ioctl(),
      ax25_getname(), ax25_sendmsg(), ax25_getsockopt() and
      ax25_info_show().
      
      One of the concurrency UAF can be shown as below:
      
        (USE)                       |    (FREE)
                                    |  ax25_device_event
                                    |    ax25_dev_device_down
      ax25_bind                     |    ...
        ...                         |      kfree(ax25_dev)
        ax25_fillin_cb()            |    ...
          ax25_fillin_cb_from_dev() |
        ...                         |
      
      The root cause of UAF bugs is that kfree(ax25_dev) in
      ax25_dev_device_down() is not protected by any locks.
      When ax25_dev, which there are still pointers point to,
      is released, the concurrency UAF bug will happen.
      
      This patch introduces refcount into ax25_dev in order to
      guarantee that there are no pointers point to it when ax25_dev
      is released.
      
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d01ffb9e
    • Duoming Zhou's avatar
      ax25: improve the incomplete fix to avoid UAF and NPD bugs · 4e0f718d
      Duoming Zhou authored
      
      
      The previous commit 1ade48d0 ("ax25: NPD bug when detaching
      AX25 device") introduce lock_sock() into ax25_kill_by_device to
      prevent NPD bug. But the concurrency NPD or UAF bug will occur,
      when lock_sock() or release_sock() dereferences the ax25_cb->sock.
      
      The NULL pointer dereference bug can be shown as below:
      
      ax25_kill_by_device()        | ax25_release()
                                   |   ax25_destroy_socket()
                                   |     ax25_cb_del()
        ...                        |     ...
                                   |     ax25->sk=NULL;
        lock_sock(s->sk); //(1)    |
        s->ax25_dev = NULL;        |     ...
        release_sock(s->sk); //(2) |
        ...                        |
      
      The root cause is that the sock is set to null before dereference
      site (1) or (2). Therefore, this patch extracts the ax25_cb->sock
      in advance, and uses ax25_list_lock to protect it, which can synchronize
      with ax25_cb_del() and ensure the value of sock is not null before
      dereference sites.
      
      The concurrency UAF bug can be shown as below:
      
      ax25_kill_by_device()        | ax25_release()
                                   |   ax25_destroy_socket()
        ...                        |   ...
                                   |   sock_put(sk); //FREE
        lock_sock(s->sk); //(1)    |
        s->ax25_dev = NULL;        |   ...
        release_sock(s->sk); //(2) |
        ...                        |
      
      The root cause is that the sock is released before dereference
      site (1) or (2). Therefore, this patch uses sock_hold() to increase
      the refcount of sock and uses ax25_list_lock to protect it, which
      can synchronize with ax25_cb_del() in ax25_destroy_socket() and
      ensure the sock wil not be released before dereference sites.
      
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e0f718d
  6. Jan 27, 2022
  7. Jan 26, 2022
  8. Jan 24, 2022
    • Amir Goldstein's avatar
      fsnotify: fix fsnotify hooks in pseudo filesystems · 29044dae
      Amir Goldstein authored
      Commit 49246466 ("fsnotify: move fsnotify_nameremove() hook out of
      d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
      will have access to a positive dentry.
      
      This allowed a race where opening the deleted file via cached dentry
      is now possible after receiving the IN_DELETE event.
      
      To fix the regression in pseudo filesystems, convert d_delete() calls
      to d_drop() (see commit 46c46f8d ("devpts_pty_kill(): don't bother
      with d_delete()") and move the fsnotify hook after d_drop().
      
      Add a missing fsnotify_unlink() hook in nfsdfs that was found during
      the audit of fsnotify hooks in pseudo filesystems.
      
      Note that the fsnotify hooks in simple_recursive_removal() follow
      d_invalidate(), so they require no change.
      
      Link: https://lore.kernel.org/r/20220120215305.282577-2-amir73il@gmail.com
      
      
      Reported-by: default avatarIvan Delalande <colona@arista.com>
      Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
      
      
      Fixes: 49246466 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
      Cc: stable@vger.kernel.org # v5.3+
      Signed-off-by: default avatarAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      29044dae
    • Xin Long's avatar
      ping: fix the sk_bound_dev_if match in ping_lookup · 2afc3b5a
      Xin Long authored
      
      
      When 'ping' changes to use PING socket instead of RAW socket by:
      
         # sysctl -w net.ipv4.ping_group_range="0 100"
      
      the selftests 'router_broadcast.sh' will fail, as such command
      
        # ip vrf exec vrf-h1 ping -I veth0 198.51.100.255 -b
      
      can't receive the response skb by the PING socket. It's caused by mismatch
      of sk_bound_dev_if and dif in ping_rcv() when looking up the PING socket,
      as dif is vrf-h1 if dif's master was set to vrf-h1.
      
      This patch is to fix this regression by also checking the sk_bound_dev_if
      against sdif so that the packets can stil be received even if the socket
      is not bound to the vrf device but to the real iif.
      
      Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
      Reported-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2afc3b5a
    • Wen Gu's avatar
      net/smc: Transitional solution for clcsock race issue · c0bf3d8a
      Wen Gu authored
      We encountered a crash in smc_setsockopt() and it is caused by
      accessing smc->clcsock after clcsock was released.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000020
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP PTI
       CPU: 1 PID: 50309 Comm: nginx Kdump: loaded Tainted: G E     5.16.0-rc4+ #53
       RIP: 0010:smc_setsockopt+0x59/0x280 [smc]
       Call Trace:
        <TASK>
        __sys_setsockopt+0xfc/0x190
        __x64_sys_setsockopt+0x20/0x30
        do_syscall_64+0x34/0x90
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f16ba83918e
        </TASK>
      
      This patch tries to fix it by holding clcsock_release_lock and
      checking whether clcsock has already been released before access.
      
      In case that a crash of the same reason happens in smc_getsockopt()
      or smc_switch_to_fallback(), this patch also checkes smc->clcsock
      in them too. And the caller of smc_switch_to_fallback() will identify
      whether fallback succeeds according to the return value.
      
      Fixes: fd57770d ("net/smc: wait for pending work before clcsock release_sock")
      Link: https://lore.kernel.org/lkml/5dd7ffd1-28e2-24cc-9442-1defec27375e@linux.ibm.com/T/
      
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c0bf3d8a
    • Jakub Kicinski's avatar
      ipv4: fix ip option filtering for locally generated fragments · 27a8caa5
      Jakub Kicinski authored
      During IP fragmentation we sanitize IP options. This means overwriting
      options which should not be copied with NOPs. Only the first fragment
      has the original, full options.
      
      ip_fraglist_prepare() copies the IP header and options from previous
      fragment to the next one. Commit 19c3401a ("net: ipv4: place control
      buffer handling away from fragmentation iterators") moved sanitizing
      options before ip_fraglist_prepare() which means options are sanitized
      and then overwritten again with the old values.
      
      Fixing this is not enough, however, nor did the sanitization work
      prior to aforementioned commit.
      
      ip_options_fragment() (which does the sanitization) uses ipcb->opt.optlen
      for the length of the options. ipcb->opt of fragments is not populated
      (it's 0), only the head skb has the state properly built. So even when
      called at the right time ip_options_fragment() does nothing. This seems
      to date back all the way to v2.5.44 when the fast path for pre-fragmented
      skbs had been introduced. Prior to that ip_options_build() would have been
      called for every fragment (in fact ever since v2.5.44 the fragmentation
      handing in ip_options_build() has been dead code, I'll clean it up in
      -next).
      
      In the original patch (see Link) caixf mentions fixing the handling
      for fragments other than the second one, but I'm not sure how _any_
      fragment could have had their options sanitized with the code
      as it stood.
      
      Tested with python (MTU on lo lowered to 1000 to force fragmentation):
      
        import socket
        s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
        s.setsockopt(socket.IPPROTO_IP, socket.IP_OPTIONS,
                     bytearray([7,4,5,192, 20|0x80,4,1,0]))
        s.sendto(b'1'*2000, ('127.0.0.1', 1234))
      
      Before:
      
      IP (tos 0x0, ttl 64, id 1053, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.36500 > localhost.search-agent: UDP, length 2000
      IP (tos 0x0, ttl 64, id 1053, offset 968, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 64, id 1053, offset 1936, flags [none], proto UDP (17), length 100, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost > localhost: udp
      
      After:
      
      IP (tos 0x0, ttl 96, id 42549, offset 0, flags [+], proto UDP (17), length 996, options (RR [bad length 4] [bad ptr 5] 192.148.4.1,,RA value 256))
          localhost.51607 > localhost.search-agent: UDP, bad length 2000 > 960
      IP (tos 0x0, ttl 96, id 42549, offset 968, flags [+], proto UDP (17), length 996, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      IP (tos 0x0, ttl 96, id 42549, offset 1936, flags [none], proto UDP (17), length 100, options (NOP,NOP,NOP,NOP,RA value 256))
          localhost > localhost: udp
      
      RA (20 | 0x80) is now copied as expected, RR (7) is "NOPed out".
      
      Link: https://lore.kernel.org/netdev/20220107080559.122713-1-ooppublic@163.com/
      
      
      Fixes: 19c3401a ("net: ipv4: place control buffer handling away from fragmentation iterators")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarcaixf <ooppublic@163.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27a8caa5
    • Jianguo Wu's avatar
      net-procfs: show net devices bound packet types · 1d10f8a1
      Jianguo Wu authored
      After commit:7866a621043f ("dev: add per net_device packet type chains"),
      we can not get packet types that are bound to a specified net device by
      /proc/net/ptype, this patch fix the regression.
      
      Run "tcpdump -i ens192 udp -nns0" Before and after apply this patch:
      
      Before:
        [root@localhost ~]# cat /proc/net/ptype
        Type Device      Function
        0800          ip_rcv
        0806          arp_rcv
        86dd          ipv6_rcv
      
      After:
        [root@localhost ~]# cat /proc/net/ptype
        Type Device      Function
        ALL  ens192   tpacket_rcv
        0800          ip_rcv
        0806          arp_rcv
        86dd          ipv6_rcv
      
      v1 -> v2:
        - fix the regression rather than adding new /proc API as
          suggested by Stephen Hemminger.
      
      Fixes: 7866a621043f ("dev: add per net_device packet type chains")
      Signed-off-by: Jianguo Wu <wujianguo@chinatelecom.cn>
      Signed-off-by: David S. Miller <davem@davemloft.net>
      1d10f8a1
  9. Jan 22, 2022
  10. Jan 21, 2022
    • Geliang Tang's avatar
      mptcp: fix removing ids bitmap setting · a4c0214f
      Geliang Tang authored
      
      
      In mptcp_pm_nl_rm_addr_or_subflow(), the bit of rm_list->ids[i] in the
      id_avail_bitmap should be set, not rm_list->ids[1]. This patch fixed it.
      
      Fixes: 86e39e04 ("mptcp: keep track of local endpoint still available for each msk")
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a4c0214f
    • Paolo Abeni's avatar
      mptcp: fix msk traversal in mptcp_nl_cmd_set_flags() · 8e9eacad
      Paolo Abeni authored
      
      
      The MPTCP endpoint list is under RCU protection, guarded by the
      pernet spinlock. mptcp_nl_cmd_set_flags() traverses the list
      without acquiring the spin-lock nor under the RCU critical section.
      
      This change addresses the issue performing the lookup and the endpoint
      update under the pernet spinlock.
      
      Fixes: 0f9f696a ("mptcp: add set_flags command in PM netlink")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e9eacad
    • Eric Dumazet's avatar
      ipv6: annotate accesses to fn->fn_sernum · aafc2e32
      Eric Dumazet authored
      
      
      struct fib6_node's fn_sernum field can be
      read while other threads change it.
      
      Add READ_ONCE()/WRITE_ONCE() annotations.
      
      Do not change existing smp barriers in fib6_get_cookie_safe()
      and __fib6_update_sernum_upto_root()
      
      syzbot reported:
      
      BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket
      
      write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1:
       fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178
       fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112
       fib6_walk net/ipv6/ip6_fib.c:2160 [inline]
       fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline]
       __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256
       fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281
       rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline]
       addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230
       addrconf_dad_work+0x908/0x1170
       process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
       worker_thread+0x616/0xa70 kernel/workqueue.c:2454
       kthread+0x1bf/0x1e0 kernel/kthread.c:359
       ret_from_fork+0x1f/0x30
      
      read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0:
       fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline]
       rt6_get_cookie include/net/ip6_fib.h:306 [inline]
       ip6_dst_store include/net/ip6_route.h:234 [inline]
       inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109
       inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121
       __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402
       tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline]
       tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680
       __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864
       tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725
       mptcp_push_release net/mptcp/protocol.c:1491 [inline]
       __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578
       mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764
       inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       kernel_sendmsg+0x97/0xd0 net/socket.c:745
       sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086
       inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834
       kernel_sendpage+0x187/0x200 net/socket.c:3492
       sock_sendpage+0x5a/0x70 net/socket.c:1007
       pipe_to_sendpage+0x128/0x160 fs/splice.c:364
       splice_from_pipe_feed fs/splice.c:418 [inline]
       __splice_from_pipe+0x207/0x500 fs/splice.c:562
       splice_from_pipe fs/splice.c:597 [inline]
       generic_splice_sendpage+0x94/0xd0 fs/splice.c:746
       do_splice_from fs/splice.c:767 [inline]
       direct_splice_actor+0x80/0xa0 fs/splice.c:936
       splice_direct_to_actor+0x345/0x650 fs/splice.c:891
       do_splice_direct+0x106/0x190 fs/splice.c:979
       do_sendfile+0x675/0xc40 fs/read_write.c:1245
       __do_sys_sendfile64 fs/read_write.c:1310 [inline]
       __se_sys_sendfile64 fs/read_write.c:1296 [inline]
       __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000026f -> 0x00000271
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      The Fixes tag I chose is probably arbitrary, I do not think
      we need to backport this patch to older kernels.
      
      Fixes: c5cff856 ("ipv6: add rcu grace period before freeing fib6_node")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      aafc2e32
    • Eric Dumazet's avatar
      tcp: add a missing sk_defer_free_flush() in tcp_splice_read() · ebdc1a03
      Eric Dumazet authored
      
      
      Without it, splice users can hit the warning
      added in commit 79074a72 ("net: Flush deferred skb free on socket destroy")
      
      Fixes: f35f8219 ("tcp: defer skb freeing after socket lock is released")
      Fixes: 79074a72 ("net: Flush deferred skb free on socket destroy")
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Gal Pressman <gal@nvidia.com>
      Link: https://lore.kernel.org/r/20220120124530.925607-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ebdc1a03
Loading