Skip to content
  1. Oct 21, 2013
  2. Oct 19, 2013
  3. Oct 18, 2013
  4. Oct 17, 2013
  5. Oct 14, 2013
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add ARP filtering support · ed683f13
      Pablo Neira Ayuso authored
      
      
      This patch registers the ARP family and he filter chain type
      for this family.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      ed683f13
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add trace support · b5bc89bf
      Pablo Neira Ayuso authored
      
      
      This patch adds support for tracing the packet travel through
      the ruleset, in a similar fashion to x_tables.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      b5bc89bf
    • Pablo Neira Ayuso's avatar
      netfilter: nfnetlink: add batch support and use it from nf_tables · 0628b123
      Pablo Neira Ayuso authored
      
      
      This patch adds a batch support to nfnetlink. Basically, it adds
      two new control messages:
      
      * NFNL_MSG_BATCH_BEGIN, that indicates the beginning of a batch,
        the nfgenmsg->res_id indicates the nfnetlink subsystem ID.
      
      * NFNL_MSG_BATCH_END, that results in the invocation of the
        ss->commit callback function. If not specified or an error
        ocurred in the batch, the ss->abort function is invoked
        instead.
      
      The end message represents the commit operation in nftables, the
      lack of end message results in an abort. This patch also adds the
      .call_batch function that is only called from the batch receival
      path.
      
      This patch adds atomic rule updates and dumps based on
      bitmask generations. This allows to atomically commit a set of
      rule-set updates incrementally without altering the internal
      state of existing nf_tables expressions/matches/targets.
      
      The idea consists of using a generation cursor of 1 bit and
      a bitmask of 2 bits per rule. Assuming the gencursor is 0,
      then the genmask (expressed as a bitmask) can be interpreted
      as:
      
      00 active in the present, will be active in the next generation.
      01 inactive in the present, will be active in the next generation.
      10 active in the present, will be deleted in the next generation.
       ^
       gencursor
      
      Once you invoke the transition to the next generation, the global
      gencursor is updated:
      
      00 active in the present, will be active in the next generation.
      01 active in the present, needs to zero its future, it becomes 00.
      10 inactive in the present, delete now.
      ^
      gencursor
      
      If a dump is in progress and nf_tables enters a new generation,
      the dump will stop and return -EBUSY to let userspace know that
      it has to retry again. In order to invalidate dumps, a global
      genctr counter is increased everytime nf_tables enters a new
      generation.
      
      This new operation can be used from the user-space utility
      that controls the firewall, eg.
      
      nft -f restore
      
      The rule updates contained in `file' will be applied atomically.
      
      cat file
      -----
      add filter INPUT ip saddr 1.1.1.1 counter accept #1
      del filter INPUT ip daddr 2.2.2.2 counter drop   #2
      -EOF-
      
      Note that the rule 1 will be inactive until the transition to the
      next generation, the rule 2 will be evicted in the next generation.
      
      There is a penalty during the rule update due to the branch
      misprediction in the packet matching framework. But that should be
      quickly resolved once the iteration over the commit list that
      contain rules that require updates is finished.
      
      Event notification happens once the rule-set update has been
      committed. So we skip notifications is case the rule-set update
      is aborted, which can happen in case that the rule-set is tested
      to apply correctly.
      
      This patch squashed the following patches from Pablo:
      
      * nf_tables: atomic rule updates and dumps
      * nf_tables: get rid of per rule list_head for commits
      * nf_tables: use per netns commit list
      * nfnetlink: add batch support and use it from nf_tables
      * nf_tables: all rule updates are transactional
      * nf_tables: attach replacement rule after stale one
      * nf_tables: do not allow deletion/replacement of stale rules
      * nf_tables: remove unused NFTA_RULE_FLAGS
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0628b123
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: complete net namespace support · 99633ab2
      Pablo Neira Ayuso authored
      
      
      Register family per netnamespace to ensure that sets are
      only visible in its approapriate namespace.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      99633ab2
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: add compatibility layer for x_tables · 0ca743a5
      Pablo Neira Ayuso authored
      
      
      This patch adds the x_tables compatibility layer. This allows you
      to use existing x_tables matches and targets from nf_tables.
      
      This compatibility later allows us to use existing matches/targets
      for features that are still missing in nf_tables. We can progressively
      replace them with native nf_tables extensions. It also provides the
      userspace compatibility software that allows you to express the
      rule-set using the iptables syntax but using the nf_tables kernel
      components.
      
      In order to get this compatibility layer working, I've done the
      following things:
      
      * add NFNL_SUBSYS_NFT_COMPAT: this new nfnetlink subsystem is used
      to query the x_tables match/target revision, so we don't need to
      use the native x_table getsockopt interface.
      
      * emulate xt structures: this required extending the struct nft_pktinfo
      to include the fragment offset, which is already obtained from
      ip[6]_tables and that is used by some matches/targets.
      
      * add support for default policy to base chains, required to emulate
        x_tables.
      
      * add NFTA_CHAIN_USE attribute to obtain the number of references to
        chains, required by x_tables emulation.
      
      * add chain packet/byte counters using per-cpu.
      
      * support 32-64 bits compat.
      
      For historical reasons, this patch includes the following patches
      that were posted in the netfilter-devel mailing list.
      
      From Pablo Neira Ayuso:
      * nf_tables: add default policy to base chains
      * netfilter: nf_tables: add NFTA_CHAIN_USE attribute
      * nf_tables: nft_compat: private data of target and matches in contiguous area
      * nf_tables: validate hooks for compat match/target
      * nf_tables: nft_compat: release cached matches/targets
      * nf_tables: x_tables support as a compile time option
      * nf_tables: fix alias for xtables over nftables module
      * nf_tables: add packet and byte counters per chain
      * nf_tables: fix per-chain counter stats if no counters are passed
      * nf_tables: don't bump chain stats
      * nf_tables: add protocol and flags for xtables over nf_tables
      * nf_tables: add ip[6]t_entry emulation
      * nf_tables: move specific layer 3 compat code to nf_tables_ipv[4|6]
      * nf_tables: support 32bits-64bits x_tables compat
      * nf_tables: fix compilation if CONFIG_COMPAT is disabled
      
      From Patrick McHardy:
      * nf_tables: move policy to struct nft_base_chain
      * nf_tables: send notifications for base chain policy changes
      
      From Alexander Primak:
      * nf_tables: remove the duplicate NF_INET_LOCAL_OUT
      
      From Nicolas Dichtel:
      * nf_tables: fix compilation when nf-netlink is a module
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      0ca743a5
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: convert built-in tables/chains to chain types · 9370761c
      Pablo Neira Ayuso authored
      
      
      This patch converts built-in tables/chains to chain types that
      allows you to deploy customized table and chain configurations from
      userspace.
      
      After this patch, you have to specify the chain type when
      creating a new chain:
      
       add chain ip filter output { type filter hook input priority 0; }
                                    ^^^^ ------
      
      The existing chain types after this patch are: filter, route and
      nat. Note that tables are just containers of chains with no specific
      semantics, which is a significant change with regards to iptables.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      9370761c
    • Patrick McHardy's avatar
      netfilter: nft_payload: add optimized payload implementation for small loads · c29b72e0
      Patrick McHardy authored
      
      
      Add an optimized payload expression implementation for small (up to 4 bytes)
      aligned data loads from the linear packet area.
      
      This patch also includes original Patrick McHardy's entitled (nf_tables:
      inline nft_payload_fast_eval() into main evaluation loop).
      
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      c29b72e0
    • Patrick McHardy's avatar
      netfilter: nf_tables: add optimized data comparison for small values · cb7dbfd0
      Patrick McHardy authored
      
      
      Add an optimized version of nft_data_cmp() that only handles values of to
      4 bytes length.
      
      This patch includes original Patrick McHardy's patch entitled (nf_tables:
      inline nft_cmp_fast_eval() into main evaluation loop).
      
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      cb7dbfd0
    • Patrick McHardy's avatar
      netfilter: nf_tables: expression ops overloading · ef1f7df9
      Patrick McHardy authored
      
      
      Split the expression ops into two parts and support overloading of
      the runtime expression ops based on the requested function through
      a ->select_ops() callback.
      
      This can be used to provide optimized implementations, for instance
      for loading small aligned amounts of data from the packet or inlining
      frequently used operations into the main evaluation loop.
      
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      ef1f7df9
    • Patrick McHardy's avatar
      netfilter: nf_tables: add netlink set API · 20a69341
      Patrick McHardy authored
      
      
      This patch adds the new netlink API for maintaining nf_tables sets
      independently of the ruleset. The API supports the following operations:
      
      - creation of sets
      - deletion of sets
      - querying of specific sets
      - dumping of all sets
      
      - addition of set elements
      - removal of set elements
      - dumping of all set elements
      
      Sets are identified by name, each table defines an individual namespace.
      The name of a set may be allocated automatically, this is mostly useful
      in combination with the NFT_SET_ANONYMOUS flag, which destroys a set
      automatically once the last reference has been released.
      
      Sets can be marked constant, meaning they're not allowed to change while
      linked to a rule. This allows to perform lockless operation for set
      types that would otherwise require locking.
      
      Additionally, if the implementation supports it, sets can (as before) be
      used as maps, associating a data value with each key (or range), by
      specifying the NFT_SET_MAP flag and can be used for interval queries by
      specifying the NFT_SET_INTERVAL flag.
      
      Set elements are added and removed incrementally. All element operations
      support batching, reducing netlink message and set lookup overhead.
      
      The old "set" and "hash" expressions are replaced by a generic "lookup"
      expression, which binds to the specified set. Userspace is not aware
      of the actual set implementation used by the kernel anymore, all
      configuration options are generic.
      
      Currently the implementation selection logic is largely missing and the
      kernel will simply use the first registered implementation supporting the
      requested operation. Eventually, the plan is to have userspace supply a
      description of the data characteristics and select the implementation
      based on expected performance and memory use.
      
      This patch includes the new 'lookup' expression to look up for element
      matching in the set.
      
      This patch includes kernel-doc descriptions for this set API and it
      also includes the following fixes.
      
      From Patrick McHardy:
      * netfilter: nf_tables: fix set element data type in dumps
      * netfilter: nf_tables: fix indentation of struct nft_set_elem comments
      * netfilter: nf_tables: fix oops in nft_validate_data_load()
      * netfilter: nf_tables: fix oops while listing sets of built-in tables
      * netfilter: nf_tables: destroy anonymous sets immediately if binding fails
      * netfilter: nf_tables: propagate context to set iter callback
      * netfilter: nf_tables: add loop detection
      
      From Pablo Neira Ayuso:
      * netfilter: nf_tables: allow to dump all existing sets
      * netfilter: nf_tables: fix wrong type for flags variable in newelem
      
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      20a69341
    • Patrick McHardy's avatar
      netfilter: add nftables · 96518518
      Patrick McHardy authored
      
      
      This patch adds nftables which is the intended successor of iptables.
      This packet filtering framework reuses the existing netfilter hooks,
      the connection tracking system, the NAT subsystem, the transparent
      proxying engine, the logging infrastructure and the userspace packet
      queueing facilities.
      
      In a nutshell, nftables provides a pseudo-state machine with 4 general
      purpose registers of 128 bits and 1 specific purpose register to store
      verdicts. This pseudo-machine comes with an extensible instruction set,
      a.k.a. "expressions" in the nftables jargon. The expressions included
      in this patch provide the basic functionality, they are:
      
      * bitwise: to perform bitwise operations.
      * byteorder: to change from host/network endianess.
      * cmp: to compare data with the content of the registers.
      * counter: to enable counters on rules.
      * ct: to store conntrack keys into register.
      * exthdr: to match IPv6 extension headers.
      * immediate: to load data into registers.
      * limit: to limit matching based on packet rate.
      * log: to log packets.
      * meta: to match metainformation that usually comes with the skbuff.
      * nat: to perform Network Address Translation.
      * payload: to fetch data from the packet payload and store it into
        registers.
      * reject (IPv4 only): to explicitly close connection, eg. TCP RST.
      
      Using this instruction-set, the userspace utility 'nft' can transform
      the rules expressed in human-readable text representation (using a
      new syntax, inspired by tcpdump) to nftables bytecode.
      
      nftables also inherits the table, chain and rule objects from
      iptables, but in a more configurable way, and it also includes the
      original datatype-agnostic set infrastructure with mapping support.
      This set infrastructure is enhanced in the follow up patch (netfilter:
      nf_tables: add netlink set API).
      
      This patch includes the following components:
      
      * the netlink API: net/netfilter/nf_tables_api.c and
        include/uapi/netfilter/nf_tables.h
      * the packet filter core: net/netfilter/nf_tables_core.c
      * the expressions (described above): net/netfilter/nft_*.c
      * the filter tables: arp, IPv4, IPv6 and bridge:
        net/ipv4/netfilter/nf_tables_ipv4.c
        net/ipv6/netfilter/nf_tables_ipv6.c
        net/ipv4/netfilter/nf_tables_arp.c
        net/bridge/netfilter/nf_tables_bridge.c
      * the NAT table (IPv4 only):
        net/ipv4/netfilter/nf_table_nat_ipv4.c
      * the route table (similar to mangle):
        net/ipv4/netfilter/nf_table_route_ipv4.c
        net/ipv6/netfilter/nf_table_route_ipv6.c
      * internal definitions under:
        include/net/netfilter/nf_tables.h
        include/net/netfilter/nf_tables_core.h
      * It also includes an skeleton expression:
        net/netfilter/nft_expr_template.c
        and the preliminary implementation of the meta target
        net/netfilter/nft_meta_target.c
      
      It also includes a change in struct nf_hook_ops to add a new
      pointer to store private data to the hook, that is used to store
      the rule list per chain.
      
      This patch is based on the patch from Patrick McHardy, plus merged
      accumulated cleanups, fixes and small enhancements to the nftables
      code that has been done since 2009, which are:
      
      From Patrick McHardy:
      * nf_tables: adjust netlink handler function signatures
      * nf_tables: only retry table lookup after successful table module load
      * nf_tables: fix event notification echo and avoid unnecessary messages
      * nft_ct: add l3proto support
      * nf_tables: pass expression context to nft_validate_data_load()
      * nf_tables: remove redundant definition
      * nft_ct: fix maxattr initialization
      * nf_tables: fix invalid event type in nf_tables_getrule()
      * nf_tables: simplify nft_data_init() usage
      * nf_tables: build in more core modules
      * nf_tables: fix double lookup expression unregistation
      * nf_tables: move expression initialization to nf_tables_core.c
      * nf_tables: build in payload module
      * nf_tables: use NFPROTO constants
      * nf_tables: rename pid variables to portid
      * nf_tables: save 48 bits per rule
      * nf_tables: introduce chain rename
      * nf_tables: check for duplicate names on chain rename
      * nf_tables: remove ability to specify handles for new rules
      * nf_tables: return error for rule change request
      * nf_tables: return error for NLM_F_REPLACE without rule handle
      * nf_tables: include NLM_F_APPEND/NLM_F_REPLACE flags in rule notification
      * nf_tables: fix NLM_F_MULTI usage in netlink notifications
      * nf_tables: include NLM_F_APPEND in rule dumps
      
      From Pablo Neira Ayuso:
      * nf_tables: fix stack overflow in nf_tables_newrule
      * nf_tables: nft_ct: fix compilation warning
      * nf_tables: nft_ct: fix crash with invalid packets
      * nft_log: group and qthreshold are 2^16
      * nf_tables: nft_meta: fix socket uid,gid handling
      * nft_counter: allow to restore counters
      * nf_tables: fix module autoload
      * nf_tables: allow to remove all rules placed in one chain
      * nf_tables: use 64-bits rule handle instead of 16-bits
      * nf_tables: fix chain after rule deletion
      * nf_tables: improve deletion performance
      * nf_tables: add missing code in route chain type
      * nf_tables: rise maximum number of expressions from 12 to 128
      * nf_tables: don't delete table if in use
      * nf_tables: fix basechain release
      
      From Tomasz Bursztyka:
      * nf_tables: Add support for changing users chain's name
      * nf_tables: Change chain's name to be fixed sized
      * nf_tables: Add support for replacing a rule by another one
      * nf_tables: Update uapi nftables netlink header documentation
      
      From Florian Westphal:
      * nft_log: group is u16, snaplen u32
      
      From Phil Oester:
      * nf_tables: operational limit match
      
      Signed-off-by: default avatarPatrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      96518518
    • Pablo Neira Ayuso's avatar
      netfilter: nf_nat: move alloc_null_binding to nf_nat_core.c · f59cb045
      Pablo Neira Ayuso authored
      
      
      Similar to nat_decode_session, alloc_null_binding is needed for both
      ip_tables and nf_tables, so move it to nf_nat_core.c. This change
      is required by nf_tables.
      
      This is an adapted version of the original patch from Patrick McHardy.
      
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      f59cb045
  6. Oct 10, 2013
    • Eric Dumazet's avatar
      inet: rename ir_loc_port to ir_num · b44084c2
      Eric Dumazet authored
      
      
      In commit 634fb979 ("inet: includes a sock_common in request_sock")
      I forgot that the two ports in sock_common do not have same byte order :
      
      skc_dport is __be16 (network order), but skc_num is __u16 (host order)
      
      So sparse complains because ir_loc_port (mapped into skc_num) is
      considered as __u16 while it should be __be16
      
      Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
      and perform appropriate htons/ntohs conversions.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b44084c2
    • Eric Dumazet's avatar
      inet: includes a sock_common in request_sock · 634fb979
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 5 :
      
      We want to be able to insert request sockets (SYN_RECV) into main
      ehash table instead of the per listener hash table to allow RCU
      lookups and remove listener lock contention.
      
      This patch includes the needed struct sock_common in front
      of struct request_sock
      
      This means there is no more inet6_request_sock IPv6 specific
      structure.
      
      Following inet_request_sock fields were renamed as they became
      macros to reference fields from struct sock_common.
      Prefix ir_ was chosen to avoid name collisions.
      
      loc_port   -> ir_loc_port
      loc_addr   -> ir_loc_addr
      rmt_addr   -> ir_rmt_addr
      rmt_port   -> ir_rmt_port
      iif        -> ir_iif
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      634fb979
  7. Oct 09, 2013
    • Eric Dumazet's avatar
      net: fix build errors if ipv6 is disabled · c2bb06db
      Eric Dumazet authored
      
      
      CONFIG_IPV6=n is still a valid choice ;)
      
      It appears we can remove dead code.
      
      Reported-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2bb06db
    • Steffen Klassert's avatar
      ipv6: Add a receive path hook for vti6 in xfrm6_mode_tunnel. · 212e5601
      Steffen Klassert authored
      
      
      Add a receive path hook for the IPsec vritual tunnel interface.
      
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      212e5601
    • Eric Dumazet's avatar
      ipv6: make lookups simpler and faster · efe4208f
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 4 :
      
      To speed up inet lookups, we moved IPv4 addresses from inet to struct
      sock_common
      
      Now is time to do the same for IPv6, because it permits us to have fast
      lookups for all kind of sockets, including upcoming SYN_RECV.
      
      Getting IPv6 addresses in TCP lookups currently requires two extra cache
      lines, plus a dereference (and memory stall).
      
      inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
      
      This patch is way bigger than its IPv4 counter part, because for IPv4,
      we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
      it's not doable easily.
      
      inet6_sk(sk)->daddr becomes sk->sk_v6_daddr
      inet6_sk(sk)->rcv_saddr becomes sk->sk_v6_rcv_saddr
      
      And timewait socket also have tw->tw_v6_daddr & tw->tw_v6_rcv_saddr
      at the same offset.
      
      We get rid of INET6_TW_MATCH() as INET6_MATCH() is now the generic
      macro.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      efe4208f
    • Eric Dumazet's avatar
      tcp/dccp: remove twchain · 05dbc7b5
      Eric Dumazet authored
      
      
      TCP listener refactoring, part 3 :
      
      Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
      and parallel SYN processing.
      
      Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
      friend states) sockets, another for TIME_WAIT sockets only.
      
      As the hash table is sized to get at most one socket per bucket, it
      makes little sense to have separate twchain, as it makes the lookup
      slightly more complicated, and doubles hash table memory usage.
      
      If we make sure all socket types have the lookup keys at the same
      offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
      and ESTABLISHED sockets already have common lookup fields for IPv4.
      
      [ INET_TW_MATCH() is no longer needed ]
      
      I'll provide a follow-up to factorize IPv6 lookup as well, to remove
      INET6_TW_MATCH()
      
      This way, SYN_RECV pseudo sockets will be supported the same.
      
      A new sock_gen_put() helper is added, doing either a sock_put() or
      inet_twsk_put() [ and will support SYN_RECV later ].
      
      Note this helper should only be called in real slow path, when rcu
      lookup found a socket that was moved to another identity (freed/reused
      immediately), but could eventually be used in other contexts, like
      sock_edemux()
      
      Before patch :
      
      dmesg | grep "TCP established"
      
      TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
      
      After patch :
      
      TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05dbc7b5
  8. Oct 08, 2013
    • Shawn Bohrer's avatar
      net: ipv4 only populate IP_PKTINFO when needed · fbf8866d
      Shawn Bohrer authored
      
      
      The since the removal of the routing cache computing
      fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
      packet received.  This has introduced a performance regression for some
      UDP workloads.
      
      This change skips populating the packet info for sockets that do not have
      IP_PKTINFO set.
      
      Benchmark results from a netperf UDP_RR test:
      Before 89789.68 transactions/s
      After  90587.62 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.63us RTT
      After  12.48us RTT
      
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbf8866d
    • Shawn Bohrer's avatar
      udp: ipv4: Add udp early demux · 421b3885
      Shawn Bohrer authored
      
      
      The removal of the routing cache introduced a performance regression for
      some UDP workloads since a dst lookup must be done for each packet.
      This change caches the dst per socket in a similar manner to what we do
      for TCP by implementing early_demux.
      
      For UDP multicast we can only cache the dst if there is only one
      receiving socket on the host.  Since caching only works when there is
      one receiving socket we do the multicast socket lookup using RCU.
      
      For UDP unicast we only demux sockets with an exact match in order to
      not break forwarding setups.  Additionally since the hash chains may be
      long we only check the first socket to see if it is a match and not
      waste extra time searching the whole chain when we might not find an
      exact match.
      
      Benchmark results from a netperf UDP_RR test:
      Before 87961.22 transactions/s
      After  89789.68 transactions/s
      
      Benchmark results from a fio 1 byte UDP multicast pingpong test
      (Multicast one way unicast response):
      Before 12.97us RTT
      After  12.63us RTT
      
      Signed-off-by: default avatarShawn Bohrer <sbohrer@rgmadvisors.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      421b3885
  9. Oct 07, 2013
    • Alexei Starovoitov's avatar
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov authored
      
      
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d45ed4a4
  10. Oct 03, 2013
  11. Oct 02, 2013
Loading