- Dec 06, 2022
-
-
Sudheer Mogilappagari authored
Add netlink based support for "ethtool -x <dev> [context x]" command by implementing ETHTOOL_MSG_RSS_GET netlink message. This is equivalent to functionality provided via ETHTOOL_GRSSH in ioctl path. It sends RSS table, hash key and hash function of an interface to user space. This patch implements existing functionality available in ioctl path and enables addition of new RSS context based parameters in future. Signed-off-by:
Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Link: https://lore.kernel.org/r/20221202002555.241580-1-sudheer.mogilappagari@intel.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Dec 02, 2022
-
-
Jonathan Toppins authored
Correct xmit hash steps for layer3+4 as introduced by commit 49aefd13 ("bonding: do not discard lowest hash bit for non layer3+4 hashing"). Signed-off-by:
Jonathan Toppins <jtoppins@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jonathan Toppins authored
With commit c1f897ce ("bonding: set default miimon value for non-arp modes if not set") the miimon default was changed from zero to 100 if arp_interval is also zero. Document this fact in bonding.rst. Fixes: c1f897ce ("bonding: set default miimon value for non-arp modes if not set") Signed-off-by:
Jonathan Toppins <jtoppins@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Dec 01, 2022
-
-
Vladimir Oltean authored
dpaa2_mac_is_type_fixed() is a header with no implementation and no callers, which is referenced from the documentation though. It can be deleted. On the other hand, it would be useful to reuse the code between dpaa2_eth_is_type_phy() and dpaa2_switch_port_is_type_phy(). That common code should be called dpaa2_mac_is_type_phy(), so let's create that. The removal and the addition are merged into the same patch because, in fact, is_type_phy() is the logical opposite of is_type_fixed(). Signed-off-by:
Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Reviewed-by:
Ioana Ciornei <ioana.ciornei@nxp.com> Tested-by:
Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
Jacob Keller authored
Implement the .read handler for the NVM and Shadow RAM regions. This enables user space to read a small chunk of the flash without needing the overhead of creating a full snapshot. Update the documentation for ice to detail which regions have direct read support. Signed-off-by:
Jacob Keller <jacob.e.keller@intel.com> Acked-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Jacob Keller authored
78ad87da ("ice: devlink: add shadow-ram region to snapshot Shadow RAM") added support for the 'shadow-ram' devlink region, but did not document it in the ice devlink documentation. Fix this. Signed-off-by:
Jacob Keller <jacob.e.keller@intel.com> Acked-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Jacob Keller authored
To read from a region, user space must currently request a new snapshot of the region and then read from that snapshot. This can sometimes be overkill if user space only reads a tiny portion. They first create the snapshot, then request a read, then destroy the snapshot. For regions which have a single underlying "contents", it makes sense to allow supporting direct reading of the region data. Extend the DEVLINK_CMD_REGION_READ to allow direct reading from a region if requested via the new DEVLINK_ATTR_REGION_DIRECT. If this attribute is set, then perform a direct read instead of using a snapshot. Direct read is mutually exclusive with DEVLINK_ATTR_REGION_SNAPSHOT_ID, and care is taken to ensure that we reject commands which provide incorrect attributes. Regions must enable support for direct read by implementing the .read() callback function. If a region does not support such direct reads, a suitable extended error message is reported. Signed-off-by:
Jacob Keller <jacob.e.keller@intel.com> Reviewed-by:
Jiri Pirko <jiri@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 30, 2022
-
-
Rahul Rameshbabu authored
Improve general readability of the device driver documentation. Signed-off-by:
Rahul Rameshbabu <rrameshbabu@nvidia.com> Reviewed-by:
Tariq Toukan <tariqt@nvidia.com> Reviewed-by:
Gal Pressman <gal@nvidia.com> Signed-off-by:
Saeed Mahameed <saeedm@nvidia.com>
-
- Nov 24, 2022
-
-
Nir Levy authored
The documentation refers to invalid web page under www.linuxfoundation.org The patch refers to a working URL under wiki.linuxfoundation.org Signed-off-by:
Nir Levy <bhr166@gmail.com> Link: https://lore.kernel.org/all/20221120220630.7443-1-bhr166@gmail.com/ Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 23, 2022
-
-
Bagas Sanjaya authored
kernel test robot reported indentation warnings: Documentation/networking/devlink/devlink-port.rst:220: WARNING: Unexpected indentation. Documentation/networking/devlink/devlink-port.rst:222: WARNING: Block quote ends without a blank line; unexpected unindent. These warnings cause lists (arbitration flow for which the warnings blame to and 3-step subfunction setup) to be rendered inline instead. Also, for the former list, automatic list numbering is messed up. Fix these warnings by adding missing blank line padding. Link: https://lore.kernel.org/linux-doc/202211200926.kfOPiVti-lkp@intel.com/ Fixes: 242dd643 ("Documentation: Add documentation for new devlink-rate attributes") Reported-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Bagas Sanjaya <bagasdotme@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Nov 18, 2022
-
-
Xin Long authored
This patch is to add sysctl net.sctp.l3mdev_accept to allow users to change the pernet global l3mdev_accept. Signed-off-by:
Xin Long <lucien.xin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Michal Wilczynski authored
Provide documentation for newly introduced netlink attributes for devlink-rate: tx_priority and tx_weight. Mention the possibility to export tree from the driver. Signed-off-by:
Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Michal Wilczynski authored
Add documentation to a newly added devlink-rate feature. Provide some examples on how to use the commands, which netlink attributes are supported and descriptions of the attributes. Signed-off-by:
Michal Wilczynski <michal.wilczynski@intel.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 16, 2022
-
-
Kuniyuki Iwashima authored
The maximum hash table size is 64K due to the nature of the protocol. [0] It's smaller than TCP, and fewer sockets can cause a performance drop. On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in different netns, creating 32Mi sockets without data transfer in the root netns causes regression for the iperf3's connection. uhash_entries sockets length Gbps 64K 1 1 5.69 1Mi 16 5.27 2Mi 32 4.90 4Mi 64 4.09 8Mi 128 2.96 16Mi 256 2.06 32Mi 512 1.12 The per-netns hash table breaks the lengthy lists into shorter ones. It is useful on a multi-tenant system with thousands of netns. With smaller hash tables, we can look up sockets faster, isolate noisy neighbours, and reduce lock contention. The max size of the per-netns table is 64K as well. This is because the possible hash range by udp_hashfn() always fits in 64K within the same netns and we cannot make full use of the whole buckets larger than 64K. /* 0 < num < 64K -> X < hash < X + 64K */ (num + net_hash_mix(net)) & mask; Also, the min size is 128. We use a bitmap to search for an available port in udp_lib_get_port(). To keep the bitmap on the stack and not fire the CONFIG_FRAME_WARN error at build time, we round up the table size to 128. The sysctl usage is the same with TCP: $ dmesg | cut -d ' ' -f 6- | grep "UDP hash" UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc) # sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = 65536 # can be changed by uhash_entries # sysctl net.ipv4.udp_child_hash_entries net.ipv4.udp_child_hash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = -65536 # share the global table # sysctl -w net.ipv4.udp_child_hash_entries=100 net.ipv4.udp_child_hash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.udp_hash_entries net.ipv4.udp_hash_entries = 128 # own a per-netns table with 2^n buckets We could optimise the hash table lookup/iteration further by removing the netns comparison for the per-netns one in the future. Also, we could optimise the sparse udp_hslot layout by putting it in udp_table. [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/ Signed-off-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Walter Heymans authored
The NFP documentation is updated to include information about Corigine, and the new NFP3800 chips. The 'Acquiring Firmware' section is updated with new information about where to find firmware. Two new sections are added to expand the coverage of the documentation. The new sections include: - Devlink Info - Configure Device Signed-off-by:
Walter Heymans <walter.heymans@corigine.com> Reviewed-by:
Niklas Söderlund <niklas.soderlund@corigine.com> Reviewed-by:
Louis Peens <louis.peens@corigine.com> Signed-off-by:
Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20221115090834.738645-1-simon.horman@corigine.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 10, 2022
-
-
Ido Schimmel authored
Add packet traps for 802.1X operation. The "eapol" control trap is used to trap EAPOL packets and is required for the correct operation of the control plane. The "locked_port" drop trap can be enabled to gain visibility into packets that were dropped by the device due to the locked bridge port check. Signed-off-by:
Ido Schimmel <idosch@nvidia.com> Reviewed-by:
Petr Machata <petrm@nvidia.com> Signed-off-by:
Petr Machata <petrm@nvidia.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Nov 08, 2022
-
-
Jakub Kicinski authored
The previous attempt to augment carrier_down (see Link) was not met with much enthusiasm so let's do the simple thing of exposing what some devices already maintain. Add a common ethtool statistic for link going down. Currently users have to maintain per-driver mapping to extract the right stat from the vendor-specific ethtool -S stats. carrier_down does not fit the bill because it counts a lot of software related false positives. Add the statistic to the extended link state API to steer vendors towards implementing all of it. Implement for bnxt and all Linux-controlled PHYs. mlx5 and (possibly) enic also have a counter for this but I leave the implementation to their maintainers. Link: https://lore.kernel.org/r/20220520004500.2250674-1-kuba@kernel.org Reviewed-by:
Florian Fainelli <f.fainelli@gmail.com> Reviewed-by:
Michael Chan <michael.chan@broadcom.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20221104190125.684910-1-kuba@kernel.org Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Nov 05, 2022
-
-
Veerasenareddy Burru authored
Add support for Octeon device CNF95N. CNF95N is a Octeon Fusion family product with same PCI NIC characteristics as CN93 which is currently supported by the driver. update supported device list in Documentation. Signed-off-by:
Veerasenareddy Burru <vburru@marvell.com> Link: https://lore.kernel.org/r/20221103060600.1858-1-vburru@marvell.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Oct 28, 2022
-
-
Mubashir Adnan Qureshi authored
PLB (Protective Load Balancing) is a host based mechanism for load balancing across switch links. It leverages congestion signals(e.g. ECN) from transport layer to randomly change the path of the connection experiencing congestion. PLB changes the path of the connection by changing the outgoing IPv6 flow label for IPv6 connections (implemented in Linux by calling sk_rethink_txhash()). Because of this implementation mechanism, PLB can currently only work for IPv6 traffic. For more information, see the SIGCOMM 2022 paper: https://doi.org/10.1145/3544216.3544226 This commit adds new sysctl knobs and sets their default values for TCP PLB. Signed-off-by:
Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by:
Yuchung Cheng <ycheng@google.com> Signed-off-by:
Neal Cardwell <ncardwell@google.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Oct 25, 2022
-
-
Amritha Nambiar authored
Add tc-queue-filters.rst with notes on TC filters for selecting a set of queues and/or a queue. Signed-off-by:
Amritha Nambiar <amritha.nambiar@intel.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Oct 19, 2022
-
-
Daniel S. Trevitz authored
Add documentation for how to use and setup the switchable termination resistor support for CAN controllers. Signed-off-by:
Daniel Trevitz <dan@sstrev.com> Link: https://lore.kernel.org/all/3441354.44csPzL39Z@daniel6430 Signed-off-by:
Marc Kleine-Budde <mkl@pengutronix.de>
-
- Oct 11, 2022
-
-
Jason A. Donenfeld authored
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by:
Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by:
Kees Cook <keescook@chromium.org> Reviewed-by:
Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by:
Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by:
Jason A. Donenfeld <Jason@zx2c4.com>
-
- Oct 06, 2022
-
-
Casper Andersson authored
Missing space between "pins'" and "strength" Signed-off-by:
Casper Andersson <casper.casan@gmail.com> Reviewed-by:
Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20221004073242.304425-1-casper.casan@gmail.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Oct 04, 2022
-
-
Oleksij Rempel authored
Add interface to support Power Sourcing Equipment. At current step it provides generic way to address all variants of PSE devices as defined in IEEE 802.3-2018 but support only objects specified for IEEE 802.3-2018 104.4 PoDL Power Sourcing Equipment (PSE). Currently supported and mandatory objects are: IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState IEEE 802.3-2018 30.15.1.2.1 acPoDLPSEAdminControl This is minimal interface needed to control PSE on each separate ethernet port but it provides not all mandatory objects specified in IEEE 802.3-2018. Since "PoDL PSE" and "PSE" have similar names, but some different values I decide to not merge them and keep separate naming schema. This should allow as to be as close to IEEE 802.3 spec as possible and avoid name conflicts in the future. This implementation is connected to PHYs instead of MACs because PSE auto classification can potentially interfere with PHY auto negotiation. So, may be some extra PHY related initialization will be needed. With WIP version of ethtools interaction with PSE capable link looks as following: $ ip l ... 5: t1l1@eth0: <BROADCAST,MULTICAST> .. ... $ ethtool --show-pse t1l1 PSE attributs for t1l1: PoDL PSE Admin State: disabled PoDL PSE Power Detection Status: disabled $ ethtool --set-pse t1l1 podl-pse-admin-control enable $ ethtool --show-pse t1l1 PSE attributs for t1l1: PoDL PSE Admin State: enabled PoDL PSE Power Detection Status: delivering power Signed-off-by:
kernel test robot <lkp@intel.com> Signed-off-by:
Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by:
Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Sep 23, 2022
-
-
Sean Anderson authored
This adds support for rate matching (also known as rate adaptation) to the phy subsystem. The general idea is that the phy interface runs at one speed, and the MAC throttles the rate at which it sends packets to the link speed. There's a good overview of several techniques for achieving this at [1]. This patch adds support for three: pause-frame based (such as in Aquantia phys), CRS-based (such as in 10PASS-TS and 2BASE-TL), and open-loop-based (such as in 10GBASE-W). This patch makes a few assumptions and a few non assumptions about the types of rate matching available. First, it assumes that different phys may use different forms of rate matching. Second, it assumes that phys can use rate matching for any of their supported link speeds (e.g. if a phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T). Third, it does not assume that all interface modes will use the same form of rate matching. Fourth, it does not assume that all phy devices will support rate matching (even if some do). Relaxing or strengthening these (non-)assumptions could result in a different API. For example, if all interface modes were assumed to use the same form of rate matching, then a bitmask of interface modes supportting rate matching would suffice. For some better visibility into the process, the current rate matching mode is exposed as part of the ethtool ksettings. For the moment, only read access is supported. I'm not sure what userspace might want to configure yet (disable it altogether, disable just one mode, specify the mode to use, etc.). For the moment, since only pause-based rate adaptation support is added in the next few commits, rate matching can be disabled altogether by adjusting the advertisement. 802.3 calls this feature "rate adaptation" in clause 49 (10GBASE-R) and "rate matching" in clause 61 (10PASS-TL and 2BASE-TS). Aquantia also calls this feature "rate adaptation". I chose "rate matching" because it is shorter, and because Russell doesn't think "adaptation" is correct in this context. Signed-off-by:
Sean Anderson <sean.anderson@seco.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 22, 2022
-
-
Tony Lu authored
Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem in clcsock. The buffer size from TCP socket doesn't fit SMC well. Generally, buffers are usually larger than TCP for SMC-R/-D to get higher performance, for they are different underlay devices and paths. So this patch unbinds buffer size from TCP, and introduces two sysctl knobs to tune them independently. Also, these knobs are per net namespace and work for containers. Signed-off-by:
Tony Lu <tonylu@linux.alibaba.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
Wen Gu authored
SMC-R tests the viability of link by sending out TEST_LINK LLC messages over RoCE fabric when connections on link have been idle for a time longer than keepalive interval (testlink time). But using tcp_keepalive_time as testlink time maybe not quite suitable because it is default no less than two hours[1], which is too long for single link to find peer dead. The active host will still use peer-dead link (QP) sending messages, and can't find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally. So this patch introduces a independent sysctl for SMC-R to set link keepalive time, in order to detect link down in time. The default value is 30 seconds. [1] https://www.rfc-editor.org/rfc/rfc1122#page-101 Signed-off-by:
Wen Gu <guwen@linux.alibaba.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Sep 21, 2022
-
-
Edward Cree authored
There's no clear explanation of what VF Representors are for, their semantics, etc., outside of vendor docs and random conference slides. Add a document explaining Representors and defining what drivers that implement them are expected to do. Signed-off-by:
Edward Cree <ecree.xilinx@gmail.com> Reviewed-by:
Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20220905135557.39233-1-ecree@xilinx.com Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Sep 20, 2022
-
-
Pablo Neira Ayuso authored
This toggle has been already remove by b1185090 ("netfilter: remove nf_conntrack_helper sysctl and modparam toggles"). Remove the documentation entry for this toggle too. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by:
Florian Westphal <fw@strlen.de>
-
Kuniyuki Iwashima authored
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by:
Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Vladimir Oltean authored
DSA now supports multiple CPU ports, explain the use cases that are covered, the new UAPI, the permitted degrees of freedom, the driver API, and remove some old "hanging fruits". Signed-off-by:
Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Sep 13, 2022
-
-
Matthieu Baerts authored
When looking at the rendered HTML version, we can see 'pm_type' is not displayed with a bold font: https://docs.kernel.org/5.19/networking/mptcp-sysctl.html The empty line under 'pm_type' is then removed to have the same style as the others. Fixes: 6bb63ccc ("mptcp: Add a per-namespace sysctl to set the default path manager type") Signed-off-by:
Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20220906180404.1255873-2-matthieu.baerts@tessares.net Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Sep 06, 2022
-
-
Dario Binacchi authored
The Amarula contact info email address is wrong, so fix it up to use the correct one. Signed-off-by:
Dario Binacchi <dario.binacchi@amarulasolutions.com> Link: https://lore.kernel.org/all/20220828134442.794990-1-dario.binacchi@amarulasolutions.com Signed-off-by:
Marc Kleine-Budde <mkl@pengutronix.de>
-
- Sep 05, 2022
-
-
Sean Anderson authored
Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in clause 70. Clause 73 autonegotiation is mandatory, and only full duplex operation is supported. Although at the PMA level this interface mode is identical to 1000BASE-X, it uses a different form of in-band autonegation. This justifies a separate interface mode, since the interface mode (along with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation which will be used on a link. This results in more than just electrical differences between the link modes. With regard to 1000BASE-X, 1000BASE-KX holds a similar position to SGMII: same signaling, but different autonegotiation. PCS drivers (which typically handle in-band autonegotiation) may only support 1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to configure serdes phys with phy_set_mode_ext. Due to the different electrical standards (SFI or XFI vs Clause 70), they will likely want to use different configuration. Adding a phy interface mode for 1000BASE-KX helps simplify configuration in these areas. Signed-off-by:
Sean Anderson <sean.anderson@seco.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 01, 2022
-
-
David Howells authored
Remove rxrpc_get_reply_time() as that is no longer used now that the call issue time is used instead of the reply time. Signed-off-by:
David Howells <dhowells@redhat.com>
-
Eric Dumazet authored
Because per host rate limiting has been proven problematic (side channel attacks can be based on it), per host rate limiting of challenge acks ideally should be per netns and turned off by default. This is a long due followup of following commits: 083ae308 ("tcp: enable per-socket rate limiting of all 'challenge acks'") f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock") 75ff39cc ("tcp: make challenge acks less predictable") Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Jason Baron <jbaron@akamai.com> Acked-by:
Neal Cardwell <ncardwell@google.com> Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Aug 31, 2022
-
-
Randy Dunlap authored
Change occurrences of "it's" that are possessive to "its" so that they don't read as "it is". Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20220829235414.17110-1-rdunlap@infradead.org Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
Fernando Fernandez Mancera authored
tlb_dynamic_lb bonding option is compatible with balance-tlb and balance-alb modes. In order to be consistent with other option documentation, it should mention both modes not only balance-tlb. Signed-off-by:
Fernando Fernandez Mancera <ffmancera@riseup.net> Acked-by:
Jay Vosburgh <jay.vosburgh@canonical.com> Link: https://lore.kernel.org/r/20220826154738.4039-1-ffmancera@riseup.net Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-
- Aug 30, 2022
-
-
Mengyuan Lou authored
Add build options and guidance doc. Initialize pci device access for Wangxun Gigabit Ethernet devices. Reviewed-by:
Andrew Lunn <andrew@lunn.ch> Signed-off-by:
Mengyuan Lou <mengyuanlou@net-swift.com> Link: https://lore.kernel.org/r/20220826034609.51854-1-mengyuanlou@net-swift.com Signed-off-by:
Paolo Abeni <pabeni@redhat.com>
-
- Aug 25, 2022
-
-
Jiri Pirko authored
As all callbacks are converted now, fix the text reflecting that change. Suggested-by:
Jakub Kicinski <kuba@kernel.org> Signed-off-by:
Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20220823070213.1008956-1-jiri@resnulli.us Signed-off-by:
Jakub Kicinski <kuba@kernel.org>
-