- Mar 27, 2020
-
-
Chuck Lever authored
rpcrdma_cm_event_handler() is always passed an @id pointer that is valid. However, in a subsequent patch, we won't be able to extract an r_xprt in every case. So instead of using the r_xprt's presentation address strings, extract them from struct rdma_cm_id. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
I eventually want to allocate rpcrdma_ep separately from struct rpcrdma_xprt so that on occasion there can be more than one ep per xprt. The new struct rpcrdma_ep will contain all the fields currently in rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings for the connection, in addition to per-connection settings negotiated with the remote. Take this opportunity to rename the existing ep fields from rep_* to re_* to disambiguate these from struct rpcrdma_rep. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Completion errors after a disconnect often occur much sooner than a CM_DISCONNECT event. Use this to try to detect connection loss more quickly. Note that other kernel ULPs do take care to disconnect explicitly when a WR is flushed. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: The upper layer serializes calls to xprt_rdma_close, so there is no need for an atomic bit operation, saving 8 bytes in rpcrdma_ia. This enables merging rpcrdma_ia_remove directly into the disconnect logic. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now responsible for allocating all per-connection hardware resources. With this clean-up, all three arms of the switch statement in rpcrdma_ep_connect are exactly the same now, thus the switch can be removed. Because device removal behaves a little differently than disconnection, there is a little more work to be done before rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So it is not quite symmetrical with rpcrdma_ep_create() yet. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Make a Protection Domain (PD) a per-connection resource rather than a per-transport resource. In other words, when the connection terminates, the PD is destroyed. Thus there is one less HW resource that remains allocated to a transport after a connection is closed. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: Simplify the synopses of functions in the connect and disconnect paths in preparation for combining the rpcrdma_ia and struct rpcrdma_ep structures. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: Simplify the synopses of functions in the post_send path by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep structures. Take the opportunity to rename the function to be consistent with the "subsystem _ object _ verb" naming scheme. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and rpcrdma_ep_destroy(). rpcrdma_ep_create will be invoked at connect time instead of at transport set-up time. It will be responsible for allocating per- connection resources. In this patch it allocates the CQs and creates a QP. More to come. rpcrdma_ep_destroy() is the inverse functionality that is invoked at disconnect time. It will be responsible for releasing the CQs and QP. These changes should be safe to do because both connect and disconnect is guaranteed to be serialized by the transport send lock. This takes us another step closer to resolving the address and route only at connect time so that connection failover to another device will work correctly. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Two changes: - Show the number of SG entries that were mapped. This helps debug DMA-related problems. - Record the MR's resource ID instead of its memory address. This groups each MR with its associated rdma-tool output, and reduces needless exposure of memory addresses. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- Mar 12, 2020
-
-
Vinicius Costa Gomes authored
There was a bug that was causing packets to be sent to the driver without first calling dequeue() on the "child" qdisc. And the KASAN report below shows that sending a packet without calling dequeue() leads to bad results. The problem is that when checking the last qdisc "child" we do not set the returned skb to NULL, which can cause it to be sent to the driver, and so after the skb is sent, it may be freed, and in some situations a reference to it may still be in the child qdisc, because it was never dequeued. The crash log looks like this: [ 19.937538] ================================================================== [ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780 [ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0 [ 19.939612] [ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97 [ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4 [ 19.941523] Call Trace: [ 19.941774] <IRQ> [ 19.941985] dump_stack+0x97/0xe0 [ 19.942323] print_address_description.constprop.0+0x3b/0x60 [ 19.942884] ? taprio_dequeue_soft+0x620/0x780 [ 19.943325] ? taprio_dequeue_soft+0x620/0x780 [ 19.943767] __kasan_report.cold+0x1a/0x32 [ 19.944173] ? taprio_dequeue_soft+0x620/0x780 [ 19.944612] kasan_report+0xe/0x20 [ 19.944954] taprio_dequeue_soft+0x620/0x780 [ 19.945380] __qdisc_run+0x164/0x18d0 [ 19.945749] net_tx_action+0x2c4/0x730 [ 19.946124] __do_softirq+0x268/0x7bc [ 19.946491] irq_exit+0x17d/0x1b0 [ 19.946824] smp_apic_timer_interrupt+0xeb/0x380 [ 19.947280] apic_timer_interrupt+0xf/0x20 [ 19.947687] </IRQ> [ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0 [ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3 [ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 [ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e [ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f [ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1 [ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001 [ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 19.954408] ? default_idle_call+0x2e/0x70 [ 19.954816] ? default_idle+0x1f/0x2d0 [ 19.955192] default_idle_call+0x5e/0x70 [ 19.955584] do_idle+0x3d4/0x500 [ 19.955909] ? arch_cpu_idle_exit+0x40/0x40 [ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30 [ 19.956829] ? trace_hardirqs_on+0x30/0x160 [ 19.957242] cpu_startup_entry+0x19/0x20 [ 19.957633] start_secondary+0x2a6/0x380 [ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0 [ 19.958486] secondary_startup_64+0xa4/0xb0 [ 19.958921] [ 19.959078] Allocated by task 33: [ 19.959412] save_stack+0x1b/0x80 [ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 19.960222] kmem_cache_alloc+0xe4/0x230 [ 19.960617] __alloc_skb+0x91/0x510 [ 19.960967] ndisc_alloc_skb+0x133/0x330 [ 19.961358] ndisc_send_ns+0x134/0x810 [ 19.961735] addrconf_dad_work+0xad5/0xf80 [ 19.962144] process_one_work+0x78e/0x13a0 [ 19.962551] worker_thread+0x8f/0xfa0 [ 19.962919] kthread+0x2ba/0x3b0 [ 19.963242] ret_from_fork+0x3a/0x50 [ 19.963596] [ 19.963753] Freed by task 33: [ 19.964055] save_stack+0x1b/0x80 [ 19.964386] __kasan_slab_free+0x12f/0x180 [ 19.964830] kmem_cache_free+0x80/0x290 [ 19.965231] ip6_mc_input+0x38a/0x4d0 [ 19.965617] ipv6_rcv+0x1a4/0x1d0 [ 19.965948] __netif_receive_skb_one_core+0xf2/0x180 [ 19.966437] netif_receive_skb+0x8c/0x3c0 [ 19.966846] br_handle_frame_finish+0x779/0x1310 [ 19.967302] br_handle_frame+0x42a/0x830 [ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90 [ 19.968167] __netif_receive_skb_one_core+0x96/0x180 [ 19.968658] process_backlog+0x198/0x650 [ 19.969047] net_rx_action+0x2fa/0xaa0 [ 19.969420] __do_softirq+0x268/0x7bc [ 19.969785] [ 19.969940] The buggy address belongs to the object at ffff888112862840 [ 19.969940] which belongs to the cache skbuff_head_cache of size 224 [ 19.971202] The buggy address is located 140 bytes inside of [ 19.971202] 224-byte region [ffff888112862840, ffff888112862920) [ 19.972344] The buggy address belongs to the page: [ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0 [ 19.973930] flags: 0x8000000000010200(slab|head) [ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0 [ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000 [ 19.975915] page dumped because: kasan: bad access detected [ 19.976461] page_owner tracks the page as allocated [ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO) [ 19.978332] prep_new_page+0x24b/0x330 [ 19.978707] get_page_from_freelist+0x2057/0x2c90 [ 19.979170] __alloc_pages_nodemask+0x218/0x590 [ 19.979619] new_slab+0x9d/0x300 [ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0 [ 19.980421] __slab_alloc.constprop.0+0x30/0x60 [ 19.980870] kmem_cache_alloc+0x201/0x230 [ 19.981269] __alloc_skb+0x91/0x510 [ 19.981620] alloc_skb_with_frags+0x78/0x4a0 [ 19.982043] sock_alloc_send_pskb+0x5eb/0x750 [ 19.982476] unix_stream_sendmsg+0x399/0x7f0 [ 19.982904] sock_sendmsg+0xe2/0x110 [ 19.983262] ____sys_sendmsg+0x4de/0x6d0 [ 19.983660] ___sys_sendmsg+0xe4/0x160 [ 19.984032] __sys_sendmsg+0xab/0x130 [ 19.984396] do_syscall_64+0xe7/0xae0 [ 19.984761] page last free stack trace: [ 19.985142] __free_pages_ok+0x432/0xbc0 [ 19.985533] qlist_free_all+0x56/0xc0 [ 19.985907] quarantine_reduce+0x149/0x170 [ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0 [ 19.986791] kmem_cache_alloc+0xe4/0x230 [ 19.987182] prepare_creds+0x24/0x440 [ 19.987548] do_faccessat+0x80/0x590 [ 19.987906] do_syscall_64+0xe7/0xae0 [ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 19.988775] [ 19.988930] Memory state around the buggy address: [ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 19.991529] ^ [ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 5a781ccb ("tc: Add support for configuring the taprio scheduler") Reported-by:
Michael Schmidt <michael.schmidt@eti.uni-siegen.de> Signed-off-by:
Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by:
Andre Guedes <andre.guedes@intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Locking newsk while still holding the listener lock triggered a lockdep splat [1] We can simply move the memcg code after we release the listener lock, as this can also help if multiple threads are sharing a common listener. Also fix a typo while reading socket sk_rmem_alloc. [1] WARNING: possible recursive locking detected 5.6.0-rc3-syzkaller #0 Not tainted -------------------------------------------- syz-executor598/9524 is trying to acquire lock: ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 but task is already holding lock: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(sk_lock-AF_INET6); lock(sk_lock-AF_INET6); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by syz-executor598/9524: #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline] #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445 stack backtrace: CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 print_deadlock_bug kernel/locking/lockdep.c:2370 [inline] check_deadlock kernel/locking/lockdep.c:2411 [inline] validate_chain kernel/locking/lockdep.c:2954 [inline] __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954 lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484 lock_sock_nested+0xc5/0x110 net/core/sock.c:2947 lock_sock include/net/sock.h:1541 [inline] inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492 inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734 __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758 __sys_accept4+0x53/0x90 net/socket.c:1809 __do_sys_accept4 net/socket.c:1821 [inline] __se_sys_accept4 net/socket.c:1818 [inline] __x64_sys_accept4+0x93/0xf0 net/socket.c:1818 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4445c9 Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000 Fixes: d752a498 ("net: memcg: late association of sock to memcg") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Shakeel Butt <shakeelb@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Paolo Lungaroni authored
The Internet Assigned Numbers Authority (IANA) has recently assigned a protocol number value of 143 for Ethernet [1]. Before this assignment, encapsulation mechanisms such as Segment Routing used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated payload is an Ethernet frame. In this patch, we add the definition of the Ethernet protocol number to the kernel headers and update the SRv6 L2 tunnels to use it. [1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml Signed-off-by:
Paolo Lungaroni <paolo.lungaroni@cnit.it> Reviewed-by:
Andrea Mayer <andrea.mayer@uniroma2.it> Acked-by:
Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Andrew Lunn authored
By default, DSA drivers should configure CPU and DSA ports to their maximum speed. In many configurations this is sufficient to make the link work. In some cases it is necessary to configure the link to run slower, e.g. because of limitations of the SoC it is connected to. Or back to back PHYs are used and the PHY needs to be driven in order to establish link. In this case, phylink is used. Only instantiate phylink if it is required. If there is no PHY, or no fixed link properties, phylink can upset a link which works in the default configuration. Fixes: 0e279218 ("net: dsa: Use PHYLINK for the CPU/DSA ports") Signed-off-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
In one error case, tpacket_rcv drops packets after incrementing the ring producer index. If this happens, it does not update tp_status to TP_STATUS_USER and thus the reader is stalled for an iteration of the ring, causing out of order arrival. The only such error path is when virtio_net_hdr_from_skb fails due to encountering an unknown GSO type. Signed-off-by:
Willem de Bruijn <willemb@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Amol Grover authored
caifdevs->list is traversed using list_for_each_entry_rcu() outside an RCU read-side critical section but under the protection of rtnl_mutex. Hence, add the corresponding lockdep expression to silence the following false-positive warning: [ 10.868467] ============================= [ 10.869082] WARNING: suspicious RCU usage [ 10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted [ 10.870804] ----------------------------- [ 10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!! Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Amol Grover <frextrite@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 11, 2020
-
-
Nicolas Cavallari authored
When trying to transmit to an unknown destination, the mesh code would unconditionally transmit a HWMP PREQ even if HWMP is not the current path selection algorithm. Signed-off-by:
Nicolas Cavallari <nicolas.cavallari@green-communications.fr> Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Jakub Kicinski authored
Add missing attribute validation for NL80211_ATTR_OPER_CLASS to the netlink policy. Fixes: 1057d35e ("cfg80211: introduce TDLS channel switch commands") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20200303051058.4089398-4-kuba@kernel.org Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Jakub Kicinski authored
Add missing attribute validation for beacon report scanning to the netlink policy. Fixes: 1d76250b ("nl80211: support beacon report scanning") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Jakub Kicinski authored
Add missing attribute validation for critical protocol fields to the netlink policy. Fixes: 5de17984 ("cfg80211: introduce critical protocol indication from user-space") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20200303051058.4089398-2-kuba@kernel.org Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
- Mar 10, 2020
-
-
Karsten Graul authored
During IB device removal, cancel the event worker before the device structure is freed. Fixes: a4cf0443 ("smc: introduce SMC as an IB-client") Reported-by:
<syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com> Signed-off-by:
Karsten Graul <kgraul@linux.ibm.com> Reviewed-by:
Ursula Braun <ubraun@linux.ibm.com> Reviewed-by:
Leon Romanovsky <leonro@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Hangbin Liu authored
Rafał found an issue that for non-Ethernet interface, if we down and up frequently, the memory will be consumed slowly. The reason is we add allnodes/allrouters addressed in multicast list in ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up() for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb getting bigger and bigger. The call stack looks like: addrconf_notify(NETDEV_REGISTER) ipv6_add_dev ipv6_dev_mc_inc(ff01::1) ipv6_dev_mc_inc(ff02::1) ipv6_dev_mc_inc(ff02::2) addrconf_notify(NETDEV_UP) addrconf_dev_config /* Alas, we support only Ethernet autoconfiguration. */ return; addrconf_notify(NETDEV_DOWN) addrconf_ifdown ipv6_mc_down igmp6_group_dropped(ff02::2) mld_add_delrec(ff02::2) igmp6_group_dropped(ff02::1) igmp6_group_dropped(ff01::1) After investigating, I can't found a rule to disable multicast on non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM, tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up() in inetdev_event(). Even for IPv6, we don't check the dev type and call ipv6_add_dev(), ipv6_dev_mc_inc() after register device. So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for non-Ethernet interface. v2: Also check IFF_MULTICAST flag to make sure the interface supports multicast Reported-by:
Rafał Miłecki <zajec5@gmail.com> Tested-by:
Rafał Miłecki <zajec5@gmail.com> Fixes: 74235a25 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels") Fixes: 1666d49e ("mld: do not remove mld souce list info when set link down") Signed-off-by:
Hangbin Liu <liuhangbin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Shakeel Butt authored
If a TCP socket is allocated in IRQ context or cloned from unassociated (i.e. not associated to a memcg) in IRQ context then it will remain unassociated for its whole life. Almost half of the TCPs created on the system are created in IRQ context, so, memory used by such sockets will not be accounted by the memcg. This issue is more widespread in cgroup v1 where network memory accounting is opt-in but it can happen in cgroup v2 if the source socket for the cloning was created in root memcg. To fix the issue, just do the association of the sockets at the accept() time in the process context and then force charge the memory buffer already used and reserved by the socket. Signed-off-by:
Shakeel Butt <shakeelb@google.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dmitry Yakunin authored
In our production environment we have faced with problem that updating classid in cgroup with heavy tasks cause long freeze of the file tables in this tasks. By heavy tasks we understand tasks with many threads and opened sockets (e.g. balancers). This freeze leads to an increase number of client timeouts. This patch implements following logic to fix this issue: аfter iterating 1000 file descriptors file table lock will be released thus providing a time gap for socket creation/deletion. Now update is non atomic and socket may be skipped using calls: dup2(oldfd, newfd); close(oldfd); But this case is not typical. Moreover before this patch skip is possible too by hiding socket fd in unix socket buffer. New sockets will be allocated with updated classid because cgroup state is updated before start of the file descriptors iteration. So in common cases this patch has no side effects. Signed-off-by:
Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by:
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 09, 2020
-
-
Dmitry Yakunin authored
In commit 1ec17dbd ("inet_diag: fix reporting cgroup classid and fallback to priority") croup classid reporting was fixed. But this works only for TCP sockets because for other socket types icsk parameter can be NULL and classid code path is skipped. This change moves classid handling to inet_diag_msg_attrs_fill() function. Also inet_diag_msg_attrs_size() helper was added and addends in nlmsg_new() were reordered to save order from inet_sk_diag_fill(). Fixes: 1ec17dbd ("inet_diag: fix reporting cgroup classid and fallback to priority") Signed-off-by:
Dmitry Yakunin <zeil@yandex-team.ru> Reviewed-by:
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
syzbot found an interesting case of the kernel reading an uninit-value [1] Problem is in the handling of ETH_P_WCCP in gre_parse_header() We look at the byte following GRE options to eventually decide if the options are four bytes longer. Use skb_header_pointer() to not pull bytes if we found that no more bytes were needed. All callers of gre_parse_header() are properly using pskb_may_pull() anyway before proceeding to next header. [1] BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2303 [inline] BUG: KMSAN: uninit-value in __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94 CPU: 1 PID: 11784 Comm: syz-executor940 Not tainted 5.6.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x220 lib/dump_stack.c:118 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215 pskb_may_pull include/linux/skbuff.h:2303 [inline] __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94 iptunnel_pull_header include/net/ip_tunnels.h:411 [inline] gre_rcv+0x15e/0x19c0 net/ipv6/ip6_gre.c:606 ip6_protocol_deliver_rcu+0x181b/0x22c0 net/ipv6/ip6_input.c:432 ip6_input_finish net/ipv6/ip6_input.c:473 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ip6_input net/ipv6/ip6_input.c:482 [inline] ip6_mc_input+0xdf2/0x1460 net/ipv6/ip6_input.c:576 dst_input include/net/dst.h:442 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:307 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:306 __netif_receive_skb_one_core net/core/dev.c:5198 [inline] __netif_receive_skb net/core/dev.c:5312 [inline] netif_receive_skb_internal net/core/dev.c:5402 [inline] netif_receive_skb+0x66b/0xf20 net/core/dev.c:5461 tun_rx_batched include/linux/skbuff.h:4321 [inline] tun_get_user+0x6aef/0x6f60 drivers/net/tun.c:1997 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write fs/read_write.c:483 [inline] __vfs_write+0xa5a/0xca0 fs/read_write.c:496 vfs_write+0x44a/0x8f0 fs/read_write.c:558 ksys_write+0x267/0x450 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __ia32_sys_write+0xdb/0x120 fs/read_write.c:620 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline] do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139 RIP: 0023:0xf7f62d99 Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90 RSP: 002b:00000000fffedb2c EFLAGS: 00000217 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000020002580 RDX: 0000000000000fca RSI: 0000000000000036 RDI: 0000000000000004 RBP: 0000000000008914 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline] kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82 slab_alloc_node mm/slub.c:2793 [inline] __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4401 __kmalloc_reserve net/core/skbuff.c:142 [inline] __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210 alloc_skb include/linux/skbuff.h:1051 [inline] alloc_skb_with_frags+0x18c/0xa70 net/core/skbuff.c:5766 sock_alloc_send_pskb+0xada/0xc60 net/core/sock.c:2242 tun_alloc_skb drivers/net/tun.c:1529 [inline] tun_get_user+0x10ae/0x6f60 drivers/net/tun.c:1843 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026 call_write_iter include/linux/fs.h:1901 [inline] new_sync_write fs/read_write.c:483 [inline] __vfs_write+0xa5a/0xca0 fs/read_write.c:496 vfs_write+0x44a/0x8f0 fs/read_write.c:558 ksys_write+0x267/0x450 fs/read_write.c:611 __do_sys_write fs/read_write.c:623 [inline] __se_sys_write fs/read_write.c:620 [inline] __ia32_sys_write+0xdb/0x120 fs/read_write.c:620 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline] do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139 Fixes: 95f5c64c ("gre: Move utility functions to common headers") Fixes: c5441932 ("GRE: Refactor GRE tunneling code.") Signed-off-by:
Eric Dumazet <edumazet@google.com> Reported-by:
syzbot <syzkaller@googlegroups.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 06, 2020
-
-
Pablo Neira Ayuso authored
Set owner to THIS_MODULE, otherwise the nft_chain_nat module might be removed while there are still inet/nat chains in place. [ 117.942096] BUG: unable to handle page fault for address: ffffffffa0d5e040 [ 117.942101] #PF: supervisor read access in kernel mode [ 117.942103] #PF: error_code(0x0000) - not-present page [ 117.942106] PGD 200c067 P4D 200c067 PUD 200d063 PMD 3dc909067 PTE 0 [ 117.942113] Oops: 0000 [#1] PREEMPT SMP PTI [ 117.942118] CPU: 3 PID: 27 Comm: kworker/3:0 Not tainted 5.6.0-rc3+ #348 [ 117.942133] Workqueue: events nf_tables_trans_destroy_work [nf_tables] [ 117.942145] RIP: 0010:nf_tables_chain_destroy.isra.0+0x94/0x15a [nf_tables] [ 117.942149] Code: f6 45 54 01 0f 84 d1 00 00 00 80 3b 05 74 44 48 8b 75 e8 48 c7 c7 72 be de a0 e8 56 e6 2d e0 48 8b 45 e8 48 c7 c7 7f be de a0 <48> 8b 30 e8 43 e6 2d e0 48 8b 45 e8 48 8b 40 10 48 85 c0 74 5b 8b [ 117.942152] RSP: 0018:ffffc9000015be10 EFLAGS: 00010292 [ 117.942155] RAX: ffffffffa0d5e040 RBX: ffff88840be87fc2 RCX: 0000000000000007 [ 117.942158] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffffffffa0debe7f [ 117.942160] RBP: ffff888403b54b50 R08: 0000000000001482 R09: 0000000000000004 [ 117.942162] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8883eda7e540 [ 117.942164] R13: dead000000000122 R14: dead000000000100 R15: ffff888403b3db80 [ 117.942167] FS: 0000000000000000(0000) GS:ffff88840e4c0000(0000) knlGS:0000000000000000 [ 117.942169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 117.942172] CR2: ffffffffa0d5e040 CR3: 00000003e4c52002 CR4: 00000000001606e0 [ 117.942174] Call Trace: [ 117.942188] nf_tables_trans_destroy_work.cold+0xd/0x12 [nf_tables] [ 117.942196] process_one_work+0x1d6/0x3b0 [ 117.942200] worker_thread+0x45/0x3c0 [ 117.942203] ? process_one_work+0x3b0/0x3b0 [ 117.942210] kthread+0x112/0x130 [ 117.942214] ? kthread_create_worker_on_cpu+0x40/0x40 [ 117.942221] ret_from_fork+0x35/0x40 nf_tables_chain_destroy() crashes on module_put() because the module is gone. Fixes: d164385e ("netfilter: nat: add inet family nat support") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Paolo Abeni authored
Currently passive MPTCP socket can skip including the DACK option - if the peer sends data before accept() completes. The above happens because the msk 'can_ack' flag is set only after the accept() call. Such missing DACK option may cause - as per RFC spec - unwanted fallback to TCP. This change addresses the issue using the key material available in the current subflow, if any, to create a suitable dack option when msk ack seq is not yet available. v1 -> v2: - adavance the generated ack after the initial MPC packet Fixes: d22f4988 ("mptcp: process MP_CAPABLE data option") Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Reviewed-by:
Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dan Carpenter authored
This is similar to commit 674d9de0 ("NFC: Fix possible memory corruption when handling SHDLC I-Frame commands") and commit d7ee81ad ("NFC: nci: Add some bounds checking in nci_hci_cmd_received()") which added range checks on "pipe". The "pipe" variable comes skb->data[0] in nfc_hci_msg_rx_work(). It's in the 0-255 range. We're using it as the array index into the hdev->pipes[] array which has NFC_HCI_MAX_PIPES (128) members. Fixes: 118278f2 ("NFC: hci: Add pipes table to reference them with a tuple {gate, host}") Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 05, 2020
-
-
Florian Westphal authored
nft will loop forever if the kernel doesn't support an expression: 1. nft_expr_type_get() appends the family specific name to the module list. 2. -EAGAIN is returned to nfnetlink, nfnetlink calls abort path. 3. abort path sets ->done to true and calls request_module for the expression. 4. nfnetlink replays the batch, we end up in nft_expr_type_get() again. 5. nft_expr_type_get attempts to append family-specific name. This one already exists on the list, so we continue 6. nft_expr_type_get adds the generic expression name to the module list. -EAGAIN is returned, nfnetlink calls abort path. 7. abort path encounters the family-specific expression which has 'done' set, so it gets removed. 8. abort path requests the generic expression name, sets done to true. 9. batch is replayed. If the expression could not be loaded, then we will end up back at 1), because the family-specific name got removed and the cycle starts again. Note that userspace can SIGKILL the nft process to stop the cycle, but the desired behaviour is to return an error after the generic expr name fails to load the expression. Fixes: eb014de4 ("netfilter: nf_tables: autoload modules from the abort path") Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Missing NFTA_CHAIN_FLAGS netlink attribute when dumping basechain definitions. Fixes: c9626a2c ("netfilter: nf_tables: add hardware offload support") Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- Mar 04, 2020
-
-
Jakub Kicinski authored
Add missing attribute validation for tunnel source and destination ports to the netlink policy. Fixes: af308b94 ("netfilter: nf_tables: add tunnel support") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Jakub Kicinski authored
Add missing attribute validation for NFTA_PAYLOAD_CSUM_FLAGS to the netlink policy. Fixes: 18140969 ("netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Jakub Kicinski authored
Add missing attribute validation for cthelper to the netlink policy. Fixes: 12f7a505 ("netfilter: add user-space connection tracking helper infrastructure") Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Florian Westphal authored
If hook registration fails, the hooks allocated via nft_netdev_hook_alloc need to be freed. We can't change the goto label to 'goto 5' -- while it does fix the memleak it does cause a warning splat from the netfilter core (the hooks were not registered). Fixes: 3f0465a9 ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables") Reported-by:
<syzbot+a2ff6fa45162a5ed4dd3@syzkaller.appspotmail.com> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vasily Averin authored
If .next function does not change position index, following .show function will repeat output related to current position index. Without patch: # dd if=/proc/net/ip_tables_matches # original file output conntrack conntrack conntrack recent recent icmp udplite udp tcp 0+1 records in 0+1 records out 65 bytes copied, 5.4074e-05 s, 1.2 MB/s # dd if=/proc/net/ip_tables_matches bs=62 skip=1 dd: /proc/net/ip_tables_matches: cannot skip to specified offset cp <<< end of last line tcp <<< and then unexpected whole last line once again 0+1 records in 0+1 records out 7 bytes copied, 0.000102447 s, 68.3 kB/s Cc: stable@vger.kernel.org Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vasily Averin authored
If .next function does not change position index, following .show function will repeat output related to current position index. Without the patch: # dd if=/proc/net/xt_recent/SSH # original file outpt src=127.0.0.4 ttl: 0 last_seen: 6275444819 oldest_pkt: 1 6275444819 src=127.0.0.2 ttl: 0 last_seen: 6275438906 oldest_pkt: 1 6275438906 src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953 0+1 records in 0+1 records out 204 bytes copied, 6.1332e-05 s, 3.3 MB/s Read after lseek into middle of last line (offset 140 in example below) generates expected end of last line and then unexpected whole last line once again # dd if=/proc/net/xt_recent/SSH bs=140 skip=1 dd: /proc/net/xt_recent/SSH: cannot skip to specified offset 127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953 src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953 0+1 records in 0+1 records out 132 bytes copied, 6.2487e-05 s, 2.1 MB/s Cc: stable@vger.kernel.org Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vasily Averin authored
If .next function does not change position index, following .show function will repeat output related to current position index. Cc: stable@vger.kernel.org Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vasily Averin authored
If .next function does not change position index, following .show function will repeat output related to current position index. Cc: stable@vger.kernel.org Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code ...") Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by:
Vasily Averin <vvs@virtuozzo.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-