Skip to content
  1. Dec 06, 2022
  2. Dec 02, 2022
  3. Dec 01, 2022
  4. Nov 30, 2022
  5. Nov 24, 2022
  6. Nov 23, 2022
  7. Nov 18, 2022
  8. Nov 16, 2022
    • Kuniyuki Iwashima's avatar
      udp: Introduce optional per-netns hash table. · 9804985b
      Kuniyuki Iwashima authored
      The maximum hash table size is 64K due to the nature of the protocol. [0]
      It's smaller than TCP, and fewer sockets can cause a performance drop.
      
      On an EC2 c5.24xlarge instance (192 GiB memory), after running iperf3 in
      different netns, creating 32Mi sockets without data transfer in the root
      netns causes regression for the iperf3's connection.
      
        uhash_entries		sockets		length		Gbps
      	    64K		      1		     1		5.69
      			    1Mi		    16		5.27
      			    2Mi		    32		4.90
      			    4Mi		    64		4.09
      			    8Mi		   128		2.96
      			   16Mi		   256		2.06
      			   32Mi		   512		1.12
      
      The per-netns hash table breaks the lengthy lists into shorter ones.  It is
      useful on a multi-tenant system with thousands of netns.  With smaller hash
      tables, we can look up sockets faster, isolate noisy neighbours, and reduce
      lock contention.
      
      The max size of the per-netns table is 64K as well.  This is because the
      possible hash range by udp_hashfn() always fits in 64K within the same
      netns and we cannot make full use of the whole buckets larger than 64K.
      
        /* 0 < num < 64K  ->  X < hash < X + 64K */
        (num + net_hash_mix(net)) & mask;
      
      Also, the min size is 128.  We use a bitmap to search for an available
      port in udp_lib_get_port().  To keep the bitmap on the stack and not
      fire the CONFIG_FRAME_WARN error at build time, we round up the table
      size to 128.
      
      The sysctl usage is the same with TCP:
      
        $ dmesg | cut -d ' ' -f 6- | grep "UDP hash"
        UDP hash table entries: 65536 (order: 9, 2097152 bytes, vmalloc)
      
        # sysctl net.ipv4.udp_hash_entries
        net.ipv4.udp_hash_entries = 65536  # can be changed by uhash_entries
      
        # sysctl net.ipv4.udp_child_hash_entries
        net.ipv4.udp_child_hash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.udp_hash_entries
        net.ipv4.udp_hash_entries = -65536  # share the global table
      
        # sysctl -w net.ipv4.udp_child_hash_entries=100
        net.ipv4.udp_child_hash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.udp_hash_entries
        net.ipv4.udp_hash_entries = 128  # own a per-netns table with 2^n buckets
      
      We could optimise the hash table lookup/iteration further by removing
      the netns comparison for the per-netns one in the future.  Also, we
      could optimise the sparse udp_hslot layout by putting it in udp_table.
      
      [0]: https://lore.kernel.org/netdev/4ACC2815.7010101@gmail.com/
      
      
      
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9804985b
    • Walter Heymans's avatar
      Documentation: nfp: update documentation · 1ec6360d
      Walter Heymans authored
      
      
      The NFP documentation is updated to include information about Corigine,
      and the new NFP3800 chips. The 'Acquiring Firmware' section is updated
      with new information about where to find firmware.
      
      Two new sections are added to expand the coverage of the documentation.
      The new sections include:
      - Devlink Info
      - Configure Device
      
      Signed-off-by: default avatarWalter Heymans <walter.heymans@corigine.com>
      Reviewed-by: default avatarNiklas Söderlund <niklas.soderlund@corigine.com>
      Reviewed-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20221115090834.738645-1-simon.horman@corigine.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1ec6360d
  9. Nov 10, 2022
  10. Nov 08, 2022
  11. Nov 05, 2022
  12. Oct 28, 2022
  13. Oct 25, 2022
  14. Oct 19, 2022
  15. Oct 11, 2022
  16. Oct 06, 2022
  17. Oct 04, 2022
    • Oleksij Rempel's avatar
      ethtool: add interface to interact with Ethernet Power Equipment · 18ff0bcd
      Oleksij Rempel authored
      
      
      Add interface to support Power Sourcing Equipment. At current step it
      provides generic way to address all variants of PSE devices as defined
      in IEEE 802.3-2018 but support only objects specified for IEEE 802.3-2018 104.4
      PoDL Power Sourcing Equipment (PSE).
      
      Currently supported and mandatory objects are:
      IEEE 802.3-2018 30.15.1.1.3 aPoDLPSEPowerDetectionStatus
      IEEE 802.3-2018 30.15.1.1.2 aPoDLPSEAdminState
      IEEE 802.3-2018 30.15.1.2.1 acPoDLPSEAdminControl
      
      This is minimal interface needed to control PSE on each separate
      ethernet port but it provides not all mandatory objects specified in
      IEEE 802.3-2018.
      
      Since "PoDL PSE" and "PSE" have similar names, but some different values
      I decide to not merge them and keep separate naming schema. This should
      allow as to be as close to IEEE 802.3 spec as possible and avoid name
      conflicts in the future.
      
      This implementation is connected to PHYs instead of MACs because PSE
      auto classification can potentially interfere with PHY auto negotiation.
      So, may be some extra PHY related initialization will be needed.
      
      With WIP version of ethtools interaction with PSE capable link looks
      as following:
      
      $ ip l
      ...
      5: t1l1@eth0: <BROADCAST,MULTICAST> ..
      ...
      
      $ ethtool --show-pse t1l1
      PSE attributs for t1l1:
      PoDL PSE Admin State: disabled
      PoDL PSE Power Detection Status: disabled
      
      $ ethtool --set-pse t1l1 podl-pse-admin-control enable
      $ ethtool --show-pse t1l1
      PSE attributs for t1l1:
      PoDL PSE Admin State: enabled
      PoDL PSE Power Detection Status: delivering power
      
      Signed-off-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarOleksij Rempel <o.rempel@pengutronix.de>
      Reviewed-by: default avatarBagas Sanjaya <bagasdotme@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18ff0bcd
  18. Sep 23, 2022
    • Sean Anderson's avatar
      net: phy: Add support for rate matching · 0c3e10cb
      Sean Anderson authored
      
      
      This adds support for rate matching (also known as rate adaptation) to
      the phy subsystem. The general idea is that the phy interface runs at
      one speed, and the MAC throttles the rate at which it sends packets to
      the link speed. There's a good overview of several techniques for
      achieving this at [1]. This patch adds support for three: pause-frame
      based (such as in Aquantia phys), CRS-based (such as in 10PASS-TS and
      2BASE-TL), and open-loop-based (such as in 10GBASE-W).
      
      This patch makes a few assumptions and a few non assumptions about the
      types of rate matching available. First, it assumes that different phys
      may use different forms of rate matching. Second, it assumes that phys
      can use rate matching for any of their supported link speeds (e.g. if a
      phy supports 10BASE-T and XGMII, then it can adapt XGMII to 10BASE-T).
      Third, it does not assume that all interface modes will use the same
      form of rate matching. Fourth, it does not assume that all phy devices
      will support rate matching (even if some do). Relaxing or strengthening
      these (non-)assumptions could result in a different API. For example, if
      all interface modes were assumed to use the same form of rate matching,
      then a bitmask of interface modes supportting rate matching would
      suffice.
      
      For some better visibility into the process, the current rate matching
      mode is exposed as part of the ethtool ksettings. For the moment, only
      read access is supported. I'm not sure what userspace might want to
      configure yet (disable it altogether, disable just one mode, specify the
      mode to use, etc.). For the moment, since only pause-based rate
      adaptation support is added in the next few commits, rate matching can
      be disabled altogether by adjusting the advertisement.
      
      802.3 calls this feature "rate adaptation" in clause 49 (10GBASE-R) and
      "rate matching" in clause 61 (10PASS-TL and 2BASE-TS). Aquantia also calls
      this feature "rate adaptation". I chose "rate matching" because it is
      shorter, and because Russell doesn't think "adaptation" is correct in this
      context.
      
      Signed-off-by: default avatarSean Anderson <sean.anderson@seco.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c3e10cb
  19. Sep 22, 2022
    • Tony Lu's avatar
      net/smc: Unbind r/w buffer size from clcsock and make them tunable · 0227f058
      Tony Lu authored
      
      
      Currently, SMC uses smc->sk.sk_{rcv|snd}buf to create buffers for
      send buffer and RMB. And the values of buffer size are from tcp_{w|r}mem
      in clcsock.
      
      The buffer size from TCP socket doesn't fit SMC well. Generally, buffers
      are usually larger than TCP for SMC-R/-D to get higher performance, for
      they are different underlay devices and paths.
      
      So this patch unbinds buffer size from TCP, and introduces two sysctl
      knobs to tune them independently. Also, these knobs are per net
      namespace and work for containers.
      
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0227f058
    • Wen Gu's avatar
      net/smc: Introduce a specific sysctl for TEST_LINK time · 77eee325
      Wen Gu authored
      SMC-R tests the viability of link by sending out TEST_LINK LLC
      messages over RoCE fabric when connections on link have been
      idle for a time longer than keepalive interval (testlink time).
      
      But using tcp_keepalive_time as testlink time maybe not quite
      suitable because it is default no less than two hours[1], which
      is too long for single link to find peer dead. The active host
      will still use peer-dead link (QP) sending messages, and can't
      find out until get IB_WC_RETRY_EXC_ERR error CQEs, which takes
      more time than TEST_LINK timeout (SMC_LLC_WAIT_TIME) normally.
      
      So this patch introduces a independent sysctl for SMC-R to set
      link keepalive time, in order to detect link down in time. The
      default value is 30 seconds.
      
      [1] https://www.rfc-editor.org/rfc/rfc1122#page-101
      
      
      
      Signed-off-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      77eee325
  20. Sep 21, 2022
  21. Sep 20, 2022
    • Pablo Neira Ayuso's avatar
      netfilter: conntrack: remove nf_conntrack_helper documentation · 76b907ee
      Pablo Neira Ayuso authored
      
      
      This toggle has been already remove by b1185090 ("netfilter: remove
      nf_conntrack_helper sysctl and modparam toggles").
      
      Remove the documentation entry for this toggle too.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      76b907ee
    • Kuniyuki Iwashima's avatar
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima authored
      
      
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d1e5e640
    • Vladimir Oltean's avatar
      docs: net: dsa: update information about multiple CPU ports · 0773e3a8
      Vladimir Oltean authored
      
      
      DSA now supports multiple CPU ports, explain the use cases that are
      covered, the new UAPI, the permitted degrees of freedom, the driver API,
      and remove some old "hanging fruits".
      
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0773e3a8
  22. Sep 13, 2022
  23. Sep 06, 2022
  24. Sep 05, 2022
    • Sean Anderson's avatar
      net: phy: Add 1000BASE-KX interface mode · 05ad5d45
      Sean Anderson authored
      
      
      Add 1000BASE-KX interface mode. This 1G backplane ethernet as described in
      clause 70. Clause 73 autonegotiation is mandatory, and only full duplex
      operation is supported.
      
      Although at the PMA level this interface mode is identical to
      1000BASE-X, it uses a different form of in-band autonegation. This
      justifies a separate interface mode, since the interface mode (along
      with the MLO_AN_* autonegotiation mode) sets the type of autonegotiation
      which will be used on a link. This results in more than just electrical
      differences between the link modes.
      
      With regard to 1000BASE-X, 1000BASE-KX holds a similar position to
      SGMII: same signaling, but different autonegotiation. PCS drivers
      (which typically handle in-band autonegotiation) may only support
      1000BASE-X, and not 1000BASE-KX. Similarly, the phy mode is used to
      configure serdes phys with phy_set_mode_ext. Due to the different
      electrical standards (SFI or XFI vs Clause 70), they will likely want to
      use different configuration. Adding a phy interface mode for
      1000BASE-KX helps simplify configuration in these areas.
      
      Signed-off-by: default avatarSean Anderson <sean.anderson@seco.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05ad5d45
  25. Sep 01, 2022
  26. Aug 31, 2022
  27. Aug 30, 2022
  28. Aug 25, 2022
Loading