Skip to content
  1. Jan 12, 2021
    • Petr Machata's avatar
      net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands · df85bc14
      Petr Machata authored
      
      
      In commit 826f328e ("net: dcb: Validate netlink message in DCB
      handler"), Linux started rejecting RTM_GETDCB netlink messages if they
      contained a set-like DCB_CMD_ command.
      
      The reason was that privileges were only verified for RTM_SETDCB messages,
      but the value that determined the action to be taken is the command, not
      the message type. And validation of message type against the DCB command
      was the obvious missing piece.
      
      Unfortunately it turns out that mlnx_qos, a somewhat widely deployed tool
      for configuration of DCB, accesses the DCB set-like APIs through
      RTM_GETDCB.
      
      Therefore do not bounce the discrepancy between message type and command.
      Instead, in addition to validating privileges based on the actual message
      type, validate them also based on the expected message type. This closes
      the loophole of allowing DCB configuration on non-admin accounts, while
      maintaining backward compatibility.
      
      Fixes: 2f90b865 ("ixgbe: this patch adds support for DCB to the kernel and ixgbe driver")
      Fixes: 826f328e ("net: dcb: Validate netlink message in DCB handler")
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Link: https://lore.kernel.org/r/a3edcfda0825f2aa2591801c5232f2bbf2d8a554.1610384801.git.me@pmachata.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      df85bc14
    • Willem de Bruijn's avatar
      esp: avoid unneeded kmap_atomic call · 9bd6b629
      Willem de Bruijn authored
      
      
      esp(6)_output_head uses skb_page_frag_refill to allocate a buffer for
      the esp trailer.
      
      It accesses the page with kmap_atomic to handle highmem. But
      skb_page_frag_refill can return compound pages, of which
      kmap_atomic only maps the first underlying page.
      
      skb_page_frag_refill does not return highmem, because flag
      __GFP_HIGHMEM is not set. ESP uses it in the same manner as TCP.
      That also does not call kmap_atomic, but directly uses page_address,
      in skb_copy_to_page_nocache. Do the same for ESP.
      
      This issue has become easier to trigger with recent kmap local
      debugging feature CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP.
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9bd6b629
    • Willem de Bruijn's avatar
      net: compound page support in skb_seq_read · 97550f6f
      Willem de Bruijn authored
      
      
      skb_seq_read iterates over an skb, returning pointer and length of
      the next data range with each call.
      
      It relies on kmap_atomic to access highmem pages when needed.
      
      An skb frag may be backed by a compound page, but kmap_atomic maps
      only a single page. There are not enough kmap slots to always map all
      pages concurrently.
      
      Instead, if kmap_atomic is needed, iterate over each page.
      
      As this increases the number of calls, avoid this unless needed.
      The necessary condition is captured in skb_frag_must_loop.
      
      I tried to make the change as obvious as possible. It should be easy
      to verify that nothing changes if skb_frag_must_loop returns false.
      
      Tested:
        On an x86 platform with
          CONFIG_HIGHMEM=y
          CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
          CONFIG_NETFILTER_XT_MATCH_STRING=y
      
        Run
          ip link set dev lo mtu 1500
          iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
          dd if=/dev/urandom of=in bs=1M count=20
          nc -l -p 8000 > /dev/null &
          nc -w 1 -q 0 localhost 8000 < in
      
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      97550f6f
  2. Jan 09, 2021
    • Hoang Le's avatar
      tipc: fix NULL deref in tipc_link_xmit() · b7741344
      Hoang Le authored
      
      
      The buffer list can have zero skb as following path:
      tipc_named_node_up()->tipc_node_xmit()->tipc_link_xmit(), so
      we need to check the list before casting an &sk_buff.
      
      Fault report:
       [] tipc: Bulk publication failure
       [] general protection fault, probably for non-canonical [#1] PREEMPT [...]
       [] KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
       [] CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 5.10.0-rc4+ #2
       [] Hardware name: Bochs ..., BIOS Bochs 01/01/2011
       [] RIP: 0010:tipc_link_xmit+0xc1/0x2180
       [] Code: 24 b8 00 00 00 00 4d 39 ec 4c 0f 44 e8 e8 d7 0a 10 f9 48 [...]
       [] RSP: 0018:ffffc90000006ea0 EFLAGS: 00010202
       [] RAX: dffffc0000000000 RBX: ffff8880224da000 RCX: 1ffff11003d3cc0d
       [] RDX: 0000000000000019 RSI: ffffffff886007b9 RDI: 00000000000000c8
       [] RBP: ffffc90000007018 R08: 0000000000000001 R09: fffff52000000ded
       [] R10: 0000000000000003 R11: fffff52000000dec R12: ffffc90000007148
       [] R13: 0000000000000000 R14: 0000000000000000 R15: ffffc90000007018
       [] FS:  0000000000000000(0000) GS:ffff888037400000(0000) knlGS:000[...]
       [] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [] CR2: 00007fffd2db5000 CR3: 000000002b08f000 CR4: 00000000000006f0
      
      Fixes: af9b028e ("tipc: make media xmit call outside node spinlock context")
      Acked-by: default avatarJon Maloy <jmaloy@redhat.com>
      Signed-off-by: default avatarHoang Le <hoang.h.le@dektech.com.au>
      Link: https://lore.kernel.org/r/20210108071337.3598-1-hoang.h.le@dektech.com.au
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b7741344
    • Aya Levin's avatar
      net: ipv6: Validate GSO SKB before finish IPv6 processing · b210de4f
      Aya Levin authored
      
      
      There are cases where GSO segment's length exceeds the egress MTU:
       - Forwarding of a TCP GRO skb, when DF flag is not set.
       - Forwarding of an skb that arrived on a virtualisation interface
         (virtio-net/vhost/tap) with TSO/GSO size set by other network
         stack.
       - Local GSO skb transmitted on an NETIF_F_TSO tunnel stacked over an
         interface with a smaller MTU.
       - Arriving GRO skb (or GSO skb in a virtualised environment) that is
         bridged to a NETIF_F_TSO tunnel stacked over an interface with an
         insufficient MTU.
      
      If so:
       - Consume the SKB and its segments.
       - Issue an ICMP packet with 'Packet Too Big' message containing the
         MTU, allowing the source host to reduce its Path MTU appropriately.
      
      Note: These cases are handled in the same manner in IPv4 output finish.
      This patch aligns the behavior of IPv6 and the one of IPv4.
      
      Fixes: 9e508490 ("netfilter: ipv6: move POSTROUTING invocation before fragmentation")
      Signed-off-by: default avatarAya Levin <ayal@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/1610027418-30438-1-git-send-email-ayal@nvidia.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b210de4f
    • Jakub Kicinski's avatar
      net: make sure devices go through netdev_wait_all_refs · 766b0515
      Jakub Kicinski authored
      
      
      If register_netdevice() fails at the very last stage - the
      notifier call - some subsystems may have already seen it and
      grabbed a reference. struct net_device can't be freed right
      away without calling netdev_wait_all_refs().
      
      Now that we have a clean interface in form of dev->needs_free_netdev
      and lenient free_netdev() we can undo what commit 93ee31f1 ("[NET]:
      Fix free_netdev on register_netdev failure.") has done and complete
      the unregistration path by bringing the net_set_todo() call back.
      
      After registration fails user is still expected to explicitly
      free the net_device, so make sure ->needs_free_netdev is cleared,
      otherwise rolling back the registration will cause the old double
      free for callers who release rtnl_lock before the free.
      
      This also solves the problem of priv_destructor not being called
      on notifier error.
      
      net_set_todo() will be moved back into unregister_netdevice_queue()
      in a follow up.
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Reported-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      766b0515
    • Jakub Kicinski's avatar
      net: make free_netdev() more lenient with unregistering devices · c269a24c
      Jakub Kicinski authored
      
      
      There are two flavors of handling netdev registration:
       - ones called without holding rtnl_lock: register_netdev() and
         unregister_netdev(); and
       - those called with rtnl_lock held: register_netdevice() and
         unregister_netdevice().
      
      While the semantics of the former are pretty clear, the same can't
      be said about the latter. The netdev_todo mechanism is utilized to
      perform some of the device unregistering tasks and it hooks into
      rtnl_unlock() so the locked variants can't actually finish the work.
      In general free_netdev() does not mix well with locked calls. Most
      drivers operating under rtnl_lock set dev->needs_free_netdev to true
      and expect core to make the free_netdev() call some time later.
      
      The part where this becomes most problematic is error paths. There is
      no way to unwind the state cleanly after a call to register_netdevice(),
      since unreg can't be performed fully without dropping locks.
      
      Make free_netdev() more lenient, and defer the freeing if device
      is being unregistered. This allows error paths to simply call
      free_netdev() both after register_netdevice() failed, and after
      a call to unregister_netdevice() but before dropping rtnl_lock.
      
      Simplify the error paths which are currently doing gymnastics
      around free_netdev() handling.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c269a24c
    • Jakub Kicinski's avatar
      docs: net: explain struct net_device lifetime · 2b446e65
      Jakub Kicinski authored
      
      
      Explain the two basic flows of struct net_device's operation.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b446e65
    • Baptiste Lepers's avatar
      udp: Prevent reuseport_select_sock from reading uninitialized socks · fd2ddef0
      Baptiste Lepers authored
      
      
      reuse->socks[] is modified concurrently by reuseport_add_sock. To
      prevent reading values that have not been fully initialized, only read
      the array up until the last known safe index instead of incorrectly
      re-reading the last index of the array.
      
      Fixes: acdcecc6 ("udp: correct reuseport selection with connected sockets")
      Signed-off-by: default avatarBaptiste Lepers <baptiste.lepers@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd2ddef0
    • Dongseok Yi's avatar
      net: fix use-after-free when UDP GRO with shared fraglist · 53475c5d
      Dongseok Yi authored
      
      
      skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
      writes, it will call skb_ensure_writable -> pskb_expand_head to create
      a private linear section for the head_skb. And then call
      skb_clone_fraglist -> skb_get on each skb in the fraglist.
      
      skb_segment_list overwrites part of the skb linear section of each
      fragment itself. Even after skb_clone, the frag_skbs share their
      linear section with their clone in PF_PACKET.
      
      Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
      a link for the same frag_skbs chain. If a new skb (not frags) is
      queued to one of the sk_receive_queue, multiple ptypes can see and
      release this. It causes use-after-free.
      
      [ 4443.426215] ------------[ cut here ]------------
      [ 4443.426222] refcount_t: underflow; use-after-free.
      [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
      refcount_dec_and_test_checked+0xa4/0xc8
      [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
      [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
      [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
      [ 4443.426808] Call trace:
      [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
      [ 4443.426823]  skb_release_data+0x144/0x264
      [ 4443.426828]  kfree_skb+0x58/0xc4
      [ 4443.426832]  skb_queue_purge+0x64/0x9c
      [ 4443.426844]  packet_set_ring+0x5f0/0x820
      [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
      [ 4443.426853]  __sys_setsockopt+0x188/0x278
      [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
      [ 4443.426869]  el0_svc_common+0xf0/0x1d0
      [ 4443.426873]  el0_svc_handler+0x74/0x98
      [ 4443.426880]  el0_svc+0x8/0xc
      
      Fixes: 3a1296a3 (net: Support GRO/GSO fraglist chaining.)
      Signed-off-by: default avatarDongseok Yi <dseok.yi@samsung.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      53475c5d
  3. Jan 08, 2021
  4. Jan 07, 2021
    • Florian Westphal's avatar
      net: ip: always refragment ip defragmented packets · bb4cc1a1
      Florian Westphal authored
      
      
      Conntrack reassembly records the largest fragment size seen in IPCB.
      However, when this gets forwarded/transmitted, fragmentation will only
      be forced if one of the fragmented packets had the DF bit set.
      
      In that case, a flag in IPCB will force fragmentation even if the
      MTU is large enough.
      
      This should work fine, but this breaks with ip tunnels.
      Consider client that sends a UDP datagram of size X to another host.
      
      The client fragments the datagram, so two packets, of size y and z, are
      sent. DF bit is not set on any of these packets.
      
      Middlebox netfilter reassembles those packets back to single size-X
      packet, before routing decision.
      
      packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit
      isn't set.  At output time, ip refragmentation is skipped as well
      because x is still smaller than the mtu of the output device.
      
      If ttransmit device is an ip tunnel, the packet size increases to
      x+overhead.
      
      Also, tunnel might be configured to force DF bit on outer header.
      
      In this case, packet will be dropped (exceeds MTU) and an ICMP error is
      generated back to sender.
      
      But sender already respects the announced MTU, all the packets that
      it sent did fit the announced mtu.
      
      Force refragmentation as per original sizes unconditionally so ip tunnel
      will encapsulate the fragments instead.
      
      The only other solution I see is to place ip refragmentation in
      the ip_tunnel code to handle this case.
      
      Fixes: d6b915e2 ("ip_fragment: don't forward defragmented DF packet")
      Reported-by: default avatarChristian Perle <christian.perle@secunet.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bb4cc1a1
    • Florian Westphal's avatar
      net: fix pmtu check in nopmtudisc mode · 50c66167
      Florian Westphal authored
      
      
      For some reason ip_tunnel insist on setting the DF bit anyway when the
      inner header has the DF bit set, EVEN if the tunnel was configured with
      'nopmtudisc'.
      
      This means that the script added in the previous commit
      cannot be made to work by adding the 'nopmtudisc' flag to the
      ip tunnel configuration. Doing so breaks connectivity even for the
      without-conntrack/netfilter scenario.
      
      When nopmtudisc is set, the tunnel will skip the mtu check, so no
      icmp error is sent to client. Then, because inner header has DF set,
      the outer header gets added with DF bit set as well.
      
      IP stack then sends an error to itself because the packet exceeds
      the device MTU.
      
      Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.")
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      50c66167
    • Sean Tranchetti's avatar
      net: ipv6: fib: flush exceptions when purging route · d8f5c296
      Sean Tranchetti authored
      
      
      Route removal is handled by two code paths. The main removal path is via
      fib6_del_route() which will handle purging any PMTU exceptions from the
      cache, removing all per-cpu copies of the DST entry used by the route, and
      releasing the fib6_info struct.
      
      The second removal location is during fib6_add_rt2node() during a route
      replacement operation. This path also calls fib6_purge_rt() to handle
      cleaning up the per-cpu copies of the DST entries and releasing the
      fib6_info associated with the older route, but it does not flush any PMTU
      exceptions that the older route had. Since the older route is removed from
      the tree during the replacement, we lose any way of accessing it again.
      
      As these lingering DSTs and the fib6_info struct are holding references to
      the underlying netdevice struct as well, unregistering that device from the
      kernel can never complete.
      
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: default avatarSean Tranchetti <stranche@codeaurora.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/1609892546-11389-1-git-send-email-stranche@quicinc.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d8f5c296
  5. Jan 06, 2021
  6. Jan 05, 2021
  7. Jan 04, 2021
  8. Dec 28, 2020
    • Cong Wang's avatar
      erspan: fix version 1 check in gre_parse_header() · 085c7c4e
      Cong Wang authored
      
      
      Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not
      have an erspan header. So the check in gre_parse_header() is wrong,
      we have to distinguish version 1 from version 0.
      
      We can just check the gre header length like is_erspan_type1().
      
      Fixes: cb73ee40 ("net: ip_gre: use erspan key field for tunnel lookup")
      Reported-by: default avatar <syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com>
      Cc: William Tu <u9012063@gmail.com>
      Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: default avatarCong Wang <cong.wang@bytedance.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      085c7c4e
    • Randy Dunlap's avatar
      net: sched: prevent invalid Scell_log shift count · bd1248f1
      Randy Dunlap authored
      
      
      Check Scell_log shift size in red_check_params() and modify all callers
      of red_check_params() to pass Scell_log.
      
      This prevents a shift out-of-bounds as detected by UBSAN:
        UBSAN: shift-out-of-bounds in ./include/net/red.h:252:22
        shift exponent 72 is too large for 32-bit type 'int'
      
      Fixes: 8afa10cb ("net_sched: red: Avoid illegal values")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reported-by: default avatar <syzbot+97c5bd9cc81eca63d36e@syzkaller.appspotmail.com>
      Cc: Nogah Frankel <nogahf@mellanox.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jiri Pirko <jiri@resnulli.us>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bd1248f1
    • weichenchen's avatar
      net: neighbor: fix a crash caused by mod zero · a533b70a
      weichenchen authored
      
      
      pneigh_enqueue() tries to obtain a random delay by mod
      NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
      migth be zero at that point because someone could write zero
      to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
      callers check it.
      
      This patch uses prandom_u32_max() to get a random delay instead
      which avoids potential division by zero.
      
      Signed-off-by: default avatarweichenchen <weichen.chen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a533b70a
    • Guillaume Nault's avatar
      ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst() · 21fdca22
      Guillaume Nault authored
      
      
      RT_TOS() only clears one of the ECN bits. Therefore, when
      fib_compute_spec_dst() resorts to a fib lookup, it can return
      different results depending on the value of the second ECN bit.
      
      For example, ECT(0) and ECT(1) packets could be treated differently.
      
        $ ip netns add ns0
        $ ip netns add ns1
        $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
        $ ip -netns ns0 link set dev lo up
        $ ip -netns ns1 link set dev lo up
        $ ip -netns ns0 link set dev veth01 up
        $ ip -netns ns1 link set dev veth10 up
      
        $ ip -netns ns0 address add 192.0.2.10/24 dev veth01
        $ ip -netns ns1 address add 192.0.2.11/24 dev veth10
      
        $ ip -netns ns1 address add 192.0.2.21/32 dev lo
        $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21
        $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0
      
      With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21
      (ping uses -Q to set all TOS and ECN bits):
      
        $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms
      
      But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11
      because the "tos 4" route isn't matched:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms
      
      After this patch the ECN bits don't affect the result anymore:
      
        $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255
        [...]
        64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms
      
      Fixes: 35ebf65e ("ipv4: Create and use fib_compute_spec_dst() helper.")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      21fdca22
    • Davide Caratti's avatar
      net: mptcp: cap forward allocation to 1M · e7579d5d
      Davide Caratti authored
      
      
      the following syzkaller reproducer:
      
       r0 = socket$inet_mptcp(0x2, 0x1, 0x106)
       bind$inet(r0, &(0x7f0000000080)={0x2, 0x4e24, @multicast2}, 0x10)
       connect$inet(r0, &(0x7f0000000480)={0x2, 0x4e24, @local}, 0x10)
       sendto$inet(r0, &(0x7f0000000100)="f6", 0xffffffe7, 0xc000, 0x0, 0x0)
      
      systematically triggers the following warning:
      
       WARNING: CPU: 2 PID: 8618 at net/core/stream.c:208 sk_stream_kill_queues+0x3fa/0x580
       Modules linked in:
       CPU: 2 PID: 8618 Comm: syz-executor Not tainted 5.10.0+ #334
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/04
       RIP: 0010:sk_stream_kill_queues+0x3fa/0x580
       Code: df 48 c1 ea 03 0f b6 04 02 84 c0 74 04 3c 03 7e 40 8b ab 20 02 00 00 e9 64 ff ff ff e8 df f0 81 2
       RSP: 0018:ffffc9000290fcb0 EFLAGS: 00010293
       RAX: ffff888011cb8000 RBX: 0000000000000000 RCX: ffffffff86eecf0e
       RDX: 0000000000000000 RSI: ffffffff86eecf6a RDI: 0000000000000005
       RBP: 0000000000000e28 R08: ffff888011cb8000 R09: fffffbfff1f48139
       R10: ffffffff8fa409c7 R11: fffffbfff1f48138 R12: ffff8880215e6220
       R13: ffffffff8fa409c0 R14: ffffc9000290fd30 R15: 1ffff92000521fa2
       FS:  00007f41c78f4800(0000) GS:ffff88802d000000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f95c803d088 CR3: 0000000025ed2000 CR4: 00000000000006f0
       Call Trace:
        __mptcp_destroy_sock+0x4f5/0x8e0
         mptcp_close+0x5e2/0x7f0
        inet_release+0x12b/0x270
        __sock_release+0xc8/0x270
        sock_close+0x18/0x20
        __fput+0x272/0x8e0
        task_work_run+0xe0/0x1a0
        exit_to_user_mode_prepare+0x1df/0x200
        syscall_exit_to_user_mode+0x19/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      userspace programs provide arbitrarily high values of 'len' in sendmsg():
      this is causing integer overflow of 'amount'. Cap forward allocation to 1
      megabyte: higher values are not really useful.
      
      Suggested-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Fixes: e93da928 ("mptcp: implement wmem reservation")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Link: https://lore.kernel.org/r/3334d00d8b2faecafdfab9aa593efcbf61442756.1608584474.git.dcaratti@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e7579d5d
    • Antoine Tenart's avatar
      net-sysfs: take the rtnl lock when accessing xps_rxqs_map and num_tc · 4ae2bb81
      Antoine Tenart authored
      
      
      Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
      protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
      see an actual bug being triggered, but let's be safe here and take the
      rtnl lock while accessing the map in sysfs.
      
      Fixes: 8af2c06f ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4ae2bb81
    • Antoine Tenart's avatar
      net-sysfs: take the rtnl lock when storing xps_rxqs · 2d57b4f1
      Antoine Tenart authored
      
      
      Two race conditions can be triggered when storing xps rxqs, resulting in
      various oops and invalid memory accesses:
      
      1. Calling netdev_set_num_tc while netif_set_xps_queue:
      
         - netif_set_xps_queue uses dev->tc_num as one of the parameters to
           compute the size of new_dev_maps when allocating it. dev->tc_num is
           also used to access the map, and the compiler may generate code to
           retrieve this field multiple times in the function.
      
         - netdev_set_num_tc sets dev->tc_num.
      
         If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
         is set to a higher value through netdev_set_num_tc, later accesses to
         new_dev_maps in netif_set_xps_queue could lead to accessing memory
         outside of new_dev_maps; triggering an oops.
      
      2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
      
         2.1. netdev_set_num_tc starts by resetting the xps queues,
              dev->tc_num isn't updated yet.
      
         2.2. netif_set_xps_queue is called, setting up the map with the
              *old* dev->num_tc.
      
         2.3. netdev_set_num_tc updates dev->tc_num.
      
         2.4. Later accesses to the map lead to out of bound accesses and
              oops.
      
         A similar issue can be found with netdev_reset_tc.
      
      One way of triggering this is to set an iface up (for which the driver
      uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
      xps_rxqs in a concurrent thread. With the right timing an oops is
      triggered.
      
      Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
      and netdev_reset_tc should be mutually exclusive. We do that by taking
      the rtnl lock in xps_rxqs_store.
      
      Fixes: 8af2c06f ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2d57b4f1
    • Antoine Tenart's avatar
      net-sysfs: take the rtnl lock when accessing xps_cpus_map and num_tc · fb250385
      Antoine Tenart authored
      
      
      Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
      protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
      see an actual bug being triggered, but let's be safe here and take the
      rtnl lock while accessing the map in sysfs.
      
      Fixes: 184c449f ("net: Add support for XPS with QoS via traffic classes")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb250385
    • Antoine Tenart's avatar
      net-sysfs: take the rtnl lock when storing xps_cpus · 1ad58225
      Antoine Tenart authored
      
      
      Two race conditions can be triggered when storing xps cpus, resulting in
      various oops and invalid memory accesses:
      
      1. Calling netdev_set_num_tc while netif_set_xps_queue:
      
         - netif_set_xps_queue uses dev->tc_num as one of the parameters to
           compute the size of new_dev_maps when allocating it. dev->tc_num is
           also used to access the map, and the compiler may generate code to
           retrieve this field multiple times in the function.
      
         - netdev_set_num_tc sets dev->tc_num.
      
         If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
         is set to a higher value through netdev_set_num_tc, later accesses to
         new_dev_maps in netif_set_xps_queue could lead to accessing memory
         outside of new_dev_maps; triggering an oops.
      
      2. Calling netif_set_xps_queue while netdev_set_num_tc is running:
      
         2.1. netdev_set_num_tc starts by resetting the xps queues,
              dev->tc_num isn't updated yet.
      
         2.2. netif_set_xps_queue is called, setting up the map with the
              *old* dev->num_tc.
      
         2.3. netdev_set_num_tc updates dev->tc_num.
      
         2.4. Later accesses to the map lead to out of bound accesses and
              oops.
      
         A similar issue can be found with netdev_reset_tc.
      
      One way of triggering this is to set an iface up (for which the driver
      uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
      xps_cpus in a concurrent thread. With the right timing an oops is
      triggered.
      
      Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
      and netdev_reset_tc should be mutually exclusive. We do that by taking
      the rtnl lock in xps_cpus_store.
      
      Fixes: 184c449f ("net: Add support for XPS with QoS via traffic classes")
      Signed-off-by: default avatarAntoine Tenart <atenart@kernel.org>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1ad58225
    • Ilya Dryomov's avatar
      libceph: align session_key and con_secret to 16 bytes · f5f2c9a0
      Ilya Dryomov authored
      
      
      crypto_shash_setkey() and crypto_aead_setkey() will do a (small)
      GFP_ATOMIC allocation to align the key if it isn't suitably aligned.
      It's not a big deal, but at the same time easy to avoid.
      
      The actual alignment requirement is dynamic, queryable with
      crypto_shash_alignmask() and crypto_aead_alignmask(), but shouldn't
      be stricter than 16 bytes for our algorithms.
      
      Fixes: cd1a677c ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      f5f2c9a0
    • Ilya Dryomov's avatar
      libceph: fix auth_signature buffer allocation in secure mode · ad32fe88
      Ilya Dryomov authored
      
      
      auth_signature frame is 68 bytes in plain mode and 96 bytes in
      secure mode but we are requesting 68 bytes in both modes.  By luck,
      this doesn't actually result in any invalid memory accesses because
      the allocation is satisfied out of kmalloc-96 slab and so exactly
      96 bytes are allocated, but KASAN rightfully complains.
      
      Fixes: cd1a677c ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)")
      Reported-by: default avatarLuis Henriques <lhenriques@suse.de>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      ad32fe88
    • Pablo Neira Ayuso's avatar
      netfilter: nftables: add set expression flags · b4e70d8d
      Pablo Neira Ayuso authored
      
      
      The set flag NFT_SET_EXPR provides a hint to the kernel that userspace
      supports for multiple expressions per set element. In the same
      direction, NFT_DYNSET_F_EXPR specifies that dynset expression defines
      multiple expressions per set element.
      
      This allows new userspace software with old kernels to bail out with
      EOPNOTSUPP. This update is similar to ef516e86 ("netfilter:
      nf_tables: reintroduce the NFT_SET_CONCAT flag"). The NFT_SET_EXPR flag
      needs to be set on when the NFTA_SET_EXPRESSIONS attribute is specified.
      The NFT_SET_EXPR flag is not set on with NFTA_SET_EXPR to retain
      backward compatibility in old userspace binaries.
      
      Fixes: 48b0ae04 ("netfilter: nftables: netlink support for several set element expressions")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b4e70d8d
    • Pablo Neira Ayuso's avatar
      netfilter: nft_dynset: report EOPNOTSUPP on missing set feature · 95cd4bca
      Pablo Neira Ayuso authored
      
      
      If userspace requests a feature which is not available the original set
      definition, then bail out with EOPNOTSUPP. If userspace sends
      unsupported dynset flags (new feature not supported by this kernel),
      then report EOPNOTSUPP to userspace. EINVAL should be only used to
      report malformed netlink messages from userspace.
      
      Fixes: 22fe54d5 ("netfilter: nf_tables: add support for dynamic set updates")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      95cd4bca
  9. Dec 27, 2020
  10. Dec 23, 2020
  11. Dec 19, 2020
  12. Dec 18, 2020
Loading