Skip to content
  1. Nov 30, 2020
    • Björn Töpel's avatar
      net: Introduce preferred busy-polling · 7fd3253a
      Björn Töpel authored
      
      
      The existing busy-polling mode, enabled by the SO_BUSY_POLL socket
      option or system-wide using the /proc/sys/net/core/busy_read knob, is
      an opportunistic. That means that if the NAPI context is not
      scheduled, it will poll it. If, after busy-polling, the budget is
      exceeded the busy-polling logic will schedule the NAPI onto the
      regular softirq handling.
      
      One implication of the behavior above is that a busy/heavy loaded NAPI
      context will never enter/allow for busy-polling. Some applications
      prefer that most NAPI processing would be done by busy-polling.
      
      This series adds a new socket option, SO_PREFER_BUSY_POLL, that works
      in concert with the napi_defer_hard_irqs and gro_flush_timeout
      knobs. The napi_defer_hard_irqs and gro_flush_timeout knobs were
      introduced in commit 6f8b12d6 ("net: napi: add hard irqs deferral
      feature"), and allows for a user to defer interrupts to be enabled and
      instead schedule the NAPI context from a watchdog timer. When a user
      enables the SO_PREFER_BUSY_POLL, again with the other knobs enabled,
      and the NAPI context is being processed by a softirq, the softirq NAPI
      processing will exit early to allow the busy-polling to be performed.
      
      If the application stops performing busy-polling via a system call,
      the watchdog timer defined by gro_flush_timeout will timeout, and
      regular softirq handling will resume.
      
      In summary; Heavy traffic applications that prefer busy-polling over
      softirq processing should use this option.
      
      Example usage:
      
        $ echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
        $ echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
      
      Note that the timeout should be larger than the userspace processing
      window, otherwise the watchdog will timeout and fall back to regular
      softirq processing.
      
      Enable the SO_BUSY_POLL/SO_PREFER_BUSY_POLL options on your socket.
      
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/bpf/20201130185205.196029-2-bjorn.topel@gmail.com
      7fd3253a
  2. Nov 25, 2020
  3. Nov 18, 2020
  4. Nov 11, 2020
  5. Nov 10, 2020
  6. Nov 09, 2020
  7. Nov 06, 2020
  8. Nov 05, 2020
  9. Oct 31, 2020
  10. Oct 30, 2020
    • Xin Long's avatar
      sctp: add SCTP_REMOTE_UDP_ENCAPS_PORT sockopt · 8dba2960
      Xin Long authored
      
      
      This patch is to implement:
      
        rfc6951#section-6.1: Get or Set the Remote UDP Encapsulation Port Number
      
      with the param of the struct:
      
        struct sctp_udpencaps {
          sctp_assoc_t sue_assoc_id;
          struct sockaddr_storage sue_address;
          uint16_t sue_port;
        };
      
      the encap_port of sock, assoc or transport can be changed by users,
      which also means it allows the different transports of the same asoc
      to have different encap_port value.
      
      v1->v2:
        - no change.
      v2->v3:
        - fix the endian warning when setting values between encap_port and
          sue_port.
      
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8dba2960
    • Henrik Bjoernlund's avatar
      bridge: cfm: Netlink GET status Interface. · e77824d8
      Henrik Bjoernlund authored
      
      
      This is the implementation of CFM netlink status
      get information interface.
      
      Add new nested netlink attributes. These attributes are used by the
      user space to get status information.
      
      GETLINK:
          Request filter RTEXT_FILTER_CFM_STATUS:
          Indicating that CFM status information must be delivered.
      
          IFLA_BRIDGE_CFM:
              Points to the CFM information.
      
          IFLA_BRIDGE_CFM_MEP_STATUS_INFO:
              This indicate that the MEP instance status are following.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO:
              This indicate that the peer MEP status are following.
      
      CFM nested attribute has the following attributes in next level.
      
      GETLINK RTEXT_FILTER_CFM_STATUS:
          IFLA_BRIDGE_CFM_MEP_STATUS_INSTANCE:
              The MEP instance number of the delivered status.
              The type is u32.
          IFLA_BRIDGE_CFM_MEP_STATUS_OPCODE_UNEXP_SEEN:
              The MEP instance received CFM PDU with unexpected Opcode.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_MEP_STATUS_VERSION_UNEXP_SEEN:
              The MEP instance received CFM PDU with unexpected version.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_MEP_STATUS_RX_LEVEL_LOW_SEEN:
              The MEP instance received CCM PDU with MD level lower than
              configured level. This frame is discarded.
              The type is u32 (bool).
      
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_INSTANCE:
              The MEP instance number of the delivered status.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_PEER_MEPID:
              The added Peer MEP ID of the delivered status.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_CCM_DEFECT:
              The CCM defect status.
              The type is u32 (bool).
              True means no CCM frame is received for 3.25 intervals.
              IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_RDI:
              The last received CCM PDU RDI.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_PORT_TLV_VALUE:
              The last received CCM PDU Port Status TLV value field.
              The type is u8.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_IF_TLV_VALUE:
              The last received CCM PDU Interface Status TLV value field.
              The type is u8.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEEN:
              A CCM frame has been received from Peer MEP.
              The type is u32 (bool).
              This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_TLV_SEEN:
              A CCM frame with TLV has been received from Peer MEP.
              The type is u32 (bool).
              This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO.
          IFLA_BRIDGE_CFM_CC_PEER_STATUS_SEQ_UNEXP_SEEN:
              A CCM frame with unexpected sequence number has been received
              from Peer MEP.
              The type is u32 (bool).
              When a sequence number is not one higher than previously received
              then it is unexpected.
              This is cleared after GETLINK IFLA_BRIDGE_CFM_CC_PEER_STATUS_INFO.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e77824d8
    • Henrik Bjoernlund's avatar
      bridge: cfm: Netlink GET configuration Interface. · 5e312fc0
      Henrik Bjoernlund authored
      
      
      This is the implementation of CFM netlink configuration
      get information interface.
      
      Add new nested netlink attributes. These attributes are used by the
      user space to get configuration information.
      
      GETLINK:
          Request filter RTEXT_FILTER_CFM_CONFIG:
          Indicating that CFM configuration information must be delivered.
      
          IFLA_BRIDGE_CFM:
              Points to the CFM information.
      
          IFLA_BRIDGE_CFM_MEP_CREATE_INFO:
              This indicate that MEP instance create parameters are following.
          IFLA_BRIDGE_CFM_MEP_CONFIG_INFO:
              This indicate that MEP instance config parameters are following.
          IFLA_BRIDGE_CFM_CC_CONFIG_INFO:
              This indicate that MEP instance CC functionality
              parameters are following.
          IFLA_BRIDGE_CFM_CC_RDI_INFO:
              This indicate that CC transmitted CCM PDU RDI
              parameters are following.
          IFLA_BRIDGE_CFM_CC_CCM_TX_INFO:
              This indicate that CC transmitted CCM PDU parameters are
              following.
          IFLA_BRIDGE_CFM_CC_PEER_MEP_INFO:
              This indicate that the added peer MEP IDs are following.
      
      CFM nested attribute has the following attributes in next level.
      
      GETLINK RTEXT_FILTER_CFM_CONFIG:
          IFLA_BRIDGE_CFM_MEP_CREATE_INSTANCE:
              The created MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_MEP_CREATE_DOMAIN:
              The created MEP domain.
              The type is u32 (br_cfm_domain).
              It must be BR_CFM_PORT.
              This means that CFM frames are transmitted and received
              directly on the port - untagged. Not in a VLAN.
          IFLA_BRIDGE_CFM_MEP_CREATE_DIRECTION:
              The created MEP direction.
              The type is u32 (br_cfm_mep_direction).
              It must be BR_CFM_MEP_DIRECTION_DOWN.
              This means that CFM frames are transmitted and received on
              the port. Not in the bridge.
          IFLA_BRIDGE_CFM_MEP_CREATE_IFINDEX:
              The created MEP residence port ifindex.
              The type is u32 (ifindex).
      
          IFLA_BRIDGE_CFM_MEP_DELETE_INSTANCE:
              The deleted MEP instance number.
              The type is u32.
      
          IFLA_BRIDGE_CFM_MEP_CONFIG_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_MEP_CONFIG_UNICAST_MAC:
              The configured MEP unicast MAC address.
              The type is 6*u8 (array).
              This is used as SMAC in all transmitted CFM frames.
          IFLA_BRIDGE_CFM_MEP_CONFIG_MDLEVEL:
              The configured MEP unicast MD level.
              The type is u32.
              It must be in the range 1-7.
              No CFM frames are passing through this MEP on lower levels.
          IFLA_BRIDGE_CFM_MEP_CONFIG_MEPID:
              The configured MEP ID.
              The type is u32.
              It must be in the range 0-0x1FFF.
              This MEP ID is inserted in any transmitted CCM frame.
      
          IFLA_BRIDGE_CFM_CC_CONFIG_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_CONFIG_ENABLE:
              The Continuity Check (CC) functionality is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL:
              The CC expected receive interval of CCM frames.
              The type is u32 (br_cfm_ccm_interval).
              This is also the transmission interval of CCM frames when enabled.
          IFLA_BRIDGE_CFM_CC_CONFIG_EXP_MAID:
              The CC expected receive MAID in CCM frames.
              The type is CFM_MAID_LENGTH*u8.
              This is MAID is also inserted in transmitted CCM frames.
      
          IFLA_BRIDGE_CFM_CC_PEER_MEP_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_PEER_MEPID:
              The CC Peer MEP ID added.
              The type is u32.
              When a Peer MEP ID is added and CC is enabled it is expected to
              receive CCM frames from that Peer MEP.
      
          IFLA_BRIDGE_CFM_CC_RDI_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_RDI_RDI:
              The RDI that is inserted in transmitted CCM PDU.
              The type is u32 (bool).
      
          IFLA_BRIDGE_CFM_CC_CCM_TX_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_CCM_TX_DMAC:
              The transmitted CCM frame destination MAC address.
              The type is 6*u8 (array).
              This is used as DMAC in all transmitted CFM frames.
          IFLA_BRIDGE_CFM_CC_CCM_TX_SEQ_NO_UPDATE:
              The transmitted CCM frame update (increment) of sequence
              number is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_PERIOD:
              The period of time where CCM frame are transmitted.
              The type is u32.
              The time is given in seconds. SETLINK IFLA_BRIDGE_CFM_CC_CCM_TX
              must be done before timeout to keep transmission alive.
              When period is zero any ongoing CCM frame transmission
              will be stopped.
          IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV:
              The transmitted CCM frame update with Interface Status TLV
              is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV_VALUE:
              The transmitted Interface Status TLV value field.
              The type is u8.
          IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV:
              The transmitted CCM frame update with Port Status TLV is enabled
              or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV_VALUE:
              The transmitted Port Status TLV value field.
              The type is u8.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5e312fc0
    • Henrik Bjoernlund's avatar
      bridge: cfm: Netlink SET configuration Interface. · 2be665c3
      Henrik Bjoernlund authored
      
      
      This is the implementation of CFM netlink configuration
      set information interface.
      
      Add new nested netlink attributes. These attributes are used by the
      user space to create/delete/configure CFM instances.
      
      SETLINK:
          IFLA_BRIDGE_CFM:
              Indicate that the following attributes are CFM.
      
          IFLA_BRIDGE_CFM_MEP_CREATE:
              This indicate that a MEP instance must be created.
          IFLA_BRIDGE_CFM_MEP_DELETE:
              This indicate that a MEP instance must be deleted.
          IFLA_BRIDGE_CFM_MEP_CONFIG:
              This indicate that a MEP instance must be configured.
          IFLA_BRIDGE_CFM_CC_CONFIG:
              This indicate that a MEP instance Continuity Check (CC)
              functionality must be configured.
          IFLA_BRIDGE_CFM_CC_PEER_MEP_ADD:
              This indicate that a CC Peer MEP must be added.
          IFLA_BRIDGE_CFM_CC_PEER_MEP_REMOVE:
              This indicate that a CC Peer MEP must be removed.
          IFLA_BRIDGE_CFM_CC_CCM_TX:
              This indicate that the CC transmitted CCM PDU must be configured.
          IFLA_BRIDGE_CFM_CC_RDI:
              This indicate that the CC transmitted CCM PDU RDI must be
              configured.
      
      CFM nested attribute has the following attributes in next level.
      
      SETLINK RTEXT_FILTER_CFM_CONFIG:
          IFLA_BRIDGE_CFM_MEP_CREATE_INSTANCE:
              The created MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_MEP_CREATE_DOMAIN:
              The created MEP domain.
              The type is u32 (br_cfm_domain).
              It must be BR_CFM_PORT.
              This means that CFM frames are transmitted and received
              directly on the port - untagged. Not in a VLAN.
          IFLA_BRIDGE_CFM_MEP_CREATE_DIRECTION:
              The created MEP direction.
              The type is u32 (br_cfm_mep_direction).
              It must be BR_CFM_MEP_DIRECTION_DOWN.
              This means that CFM frames are transmitted and received on
              the port. Not in the bridge.
          IFLA_BRIDGE_CFM_MEP_CREATE_IFINDEX:
              The created MEP residence port ifindex.
              The type is u32 (ifindex).
      
          IFLA_BRIDGE_CFM_MEP_DELETE_INSTANCE:
              The deleted MEP instance number.
              The type is u32.
      
          IFLA_BRIDGE_CFM_MEP_CONFIG_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_MEP_CONFIG_UNICAST_MAC:
              The configured MEP unicast MAC address.
              The type is 6*u8 (array).
              This is used as SMAC in all transmitted CFM frames.
          IFLA_BRIDGE_CFM_MEP_CONFIG_MDLEVEL:
              The configured MEP unicast MD level.
              The type is u32.
              It must be in the range 1-7.
              No CFM frames are passing through this MEP on lower levels.
          IFLA_BRIDGE_CFM_MEP_CONFIG_MEPID:
              The configured MEP ID.
              The type is u32.
              It must be in the range 0-0x1FFF.
              This MEP ID is inserted in any transmitted CCM frame.
      
          IFLA_BRIDGE_CFM_CC_CONFIG_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_CONFIG_ENABLE:
              The Continuity Check (CC) functionality is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CONFIG_EXP_INTERVAL:
              The CC expected receive interval of CCM frames.
              The type is u32 (br_cfm_ccm_interval).
              This is also the transmission interval of CCM frames when enabled.
          IFLA_BRIDGE_CFM_CC_CONFIG_EXP_MAID:
              The CC expected receive MAID in CCM frames.
              The type is CFM_MAID_LENGTH*u8.
              This is MAID is also inserted in transmitted CCM frames.
      
          IFLA_BRIDGE_CFM_CC_PEER_MEP_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_PEER_MEPID:
              The CC Peer MEP ID added.
              The type is u32.
              When a Peer MEP ID is added and CC is enabled it is expected to
              receive CCM frames from that Peer MEP.
      
          IFLA_BRIDGE_CFM_CC_RDI_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_RDI_RDI:
              The RDI that is inserted in transmitted CCM PDU.
              The type is u32 (bool).
      
          IFLA_BRIDGE_CFM_CC_CCM_TX_INSTANCE:
              The configured MEP instance number.
              The type is u32.
          IFLA_BRIDGE_CFM_CC_CCM_TX_DMAC:
              The transmitted CCM frame destination MAC address.
              The type is 6*u8 (array).
              This is used as DMAC in all transmitted CFM frames.
          IFLA_BRIDGE_CFM_CC_CCM_TX_SEQ_NO_UPDATE:
              The transmitted CCM frame update (increment) of sequence
              number is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_PERIOD:
              The period of time where CCM frame are transmitted.
              The type is u32.
              The time is given in seconds. SETLINK IFLA_BRIDGE_CFM_CC_CCM_TX
              must be done before timeout to keep transmission alive.
              When period is zero any ongoing CCM frame transmission
              will be stopped.
          IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV:
              The transmitted CCM frame update with Interface Status TLV
              is enabled or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_IF_TLV_VALUE:
              The transmitted Interface Status TLV value field.
              The type is u8.
          IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV:
              The transmitted CCM frame update with Port Status TLV is enabled
              or disabled.
              The type is u32 (bool).
          IFLA_BRIDGE_CFM_CC_CCM_TX_PORT_TLV_VALUE:
              The transmitted Port Status TLV value field.
              The type is u8.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2be665c3
    • Henrik Bjoernlund's avatar
      bridge: cfm: Kernel space implementation of CFM. CCM frame RX added. · dc32cbb3
      Henrik Bjoernlund authored
      
      
      This is the third commit of the implementation of the CFM protocol
      according to 802.1Q section 12.14.
      
      Functionality is extended with CCM frame reception.
      The MEP instance now contains CCM based status information.
      Most important is the CCM defect status indicating if correct
      CCM frames are received with the expected interval.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dc32cbb3
    • Henrik Bjoernlund's avatar
      bridge: cfm: Kernel space implementation of CFM. CCM frame TX added. · a806ad8e
      Henrik Bjoernlund authored
      
      
      This is the second commit of the implementation of the CFM protocol
      according to 802.1Q section 12.14.
      
      Functionality is extended with CCM frame transmission.
      
      Interface is extended with these functions:
      br_cfm_cc_rdi_set()
      br_cfm_cc_ccm_tx()
      br_cfm_cc_config_set()
      
      A MEP Continuity Check feature can be configured by
      br_cfm_cc_config_set()
          The Continuity Check parameters can be configured to be used when
          transmitting CCM.
      
      A MEP can be configured to start or stop transmission of CCM frames by
      br_cfm_cc_ccm_tx()
          The CCM will be transmitted for a selected period in seconds.
          Must call this function before timeout to keep transmission alive.
      
      A MEP transmitting CCM can be configured with inserted RDI in PDU by
      br_cfm_cc_rdi_set()
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a806ad8e
    • Henrik Bjoernlund's avatar
      bridge: cfm: Kernel space implementation of CFM. MEP create/delete. · 86a14b79
      Henrik Bjoernlund authored
      
      
      This is the first commit of the implementation of the CFM protocol
      according to 802.1Q section 12.14.
      
      It contains MEP instance create, delete and configuration.
      
      Connectivity Fault Management (CFM) comprises capabilities for
      detecting, verifying, and isolating connectivity failures in
      Virtual Bridged Networks. These capabilities can be used in
      networks operated by multiple independent organizations, each
      with restricted management access to each others equipment.
      
      CFM functions are partitioned as follows:
          - Path discovery
          - Fault detection
          - Fault verification and isolation
          - Fault notification
          - Fault recovery
      
      Interface consists of these functions:
      br_cfm_mep_create()
      br_cfm_mep_delete()
      br_cfm_mep_config_set()
      br_cfm_cc_config_set()
      br_cfm_cc_peer_mep_add()
      br_cfm_cc_peer_mep_remove()
      
      A MEP instance is created by br_cfm_mep_create()
          -It is the Maintenance association End Point
           described in 802.1Q section 19.2.
          -It is created on a specific level (1-7) and is assuring
           that no CFM frames are passing through this MEP on lower levels.
          -It initiates and validates CFM frames on its level.
          -It can only exist on a port that is related to a bridge.
          -Attributes given cannot be changed until the instance is
           deleted.
      
      A MEP instance can be deleted by br_cfm_mep_delete().
      
      A created MEP instance has attributes that can be
      configured by br_cfm_mep_config_set().
      
      A MEP Continuity Check feature can be configured by
      br_cfm_cc_config_set()
          The Continuity Check Receiver state machine can be
          enabled and disabled.
          According to 802.1Q section 19.2.8
      
      A MEP can have Peer MEPs added and removed by
      br_cfm_cc_peer_mep_add() and br_cfm_cc_peer_mep_remove()
          The Continuity Check feature can maintain connectivity
          status on each added Peer MEP.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      86a14b79
    • Henrik Bjoernlund's avatar
      bridge: uapi: cfm: Added EtherType used by the CFM protocol. · fbaedb41
      Henrik Bjoernlund authored
      
      
      This EtherType is used by all CFM protocal frames transmitted
      according to 802.1Q section 12.14.
      
      Signed-off-by: default avatarHenrik Bjoernlund <henrik.bjoernlund@microchip.com>
      Reviewed-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fbaedb41
  11. Oct 29, 2020
  12. Oct 28, 2020
  13. Oct 26, 2020
  14. Oct 23, 2020
  15. Oct 21, 2020
  16. Oct 19, 2020
  17. Oct 18, 2020
    • Minchan Kim's avatar
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim authored
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
      
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
  18. Oct 16, 2020
  19. Oct 15, 2020
Loading