Commits · eb8637cd4a0d651cf4fcc1559231facee829a0ac · jan.koester / Linux

Jul 18, 2012

net/ipv4: VTI support rx-path hook in xfrm4_mode_tunnel. · eb8637cd

Saurabh authored Jul 17, 2012



Incorporated David and Steffen's comments.
Add hook for rx-path xfmr4_mode_tunnel for VTI tunnel module.

Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reviewed-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eb8637cd

tcp: refine SYN handling in tcp_validate_incoming · e3715899

Eric Dumazet authored Jul 17, 2012



Followup of commit 0c24604b (tcp: implement RFC 5961 4.2)

As reported by Vijay Subramanian, we should send a challenge ACK
instead of a dup ack if a SYN flag is set on a packet received out of
window.

This permits the ratelimiting to work as intended, and to increase
correct SNMP counters.

Suggested-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Vijay Subramanian <subramanian.vijay@gmail.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e3715899

Jul 17, 2012

ipv4: fix rcu splat · 5abf7f7e

Eric Dumazet authored Jul 17, 2012



free_nh_exceptions() should use rcu_dereference_protected(..., 1)
since its called after one RCU grace period.

Also add some const-ification in recent code.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5abf7f7e

ipv4: Fix nexthop exception hash computation. · d3a25c98

David S. Miller authored Jul 17, 2012



Need to mask it with (FNHE_HASH_SIZE - 1).

Signed-off-by: David S. Miller <davem@davemloft.net>

d3a25c98

ipv4: Add FIB nexthop exceptions. · 4895c771

David S. Miller authored Jul 17, 2012



In a regime where we have subnetted route entries, we need a way to
store persistent storage about destination specific learned values
such as redirects and PMTU values.

This is implemented here via nexthop exceptions.

The initial implementation is a 2048 entry hash table with relaiming
starting at chain length 5.  A more sophisticated scheme can be
devised if that proves necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>

4895c771

tcp: implement RFC 5961 4.2 · 0c24604b

Eric Dumazet authored Jul 17, 2012



Implement the RFC 5691 mitigation against Blind
Reset attack using SYN bit.

Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
incoming packet, instead of resetting the session.

Add a new SNMP counter to count number of challenge acks sent
in response to SYN packets.
(netstat -s | grep TCPSYNChallenge)

Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
because of a SYN flag.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0c24604b

net: Pass optional SKB and SK arguments to dst_ops->{update_pmtu,redirect}() · 6700c270

David S. Miller authored Jul 17, 2012

This will be used so that we can compose a full flow key.

Even though we have a route in this context, we need more. In the
future the routes will be without destination address, source address,
etc. keying. One ipv4 route will cover entire subnets, etc.

In this environment we have to have a way to possess persistent storage
for redirects and PMTU information. This persistent storage will exist
in the FIB tables, and that's why we'll need to be able to rebuild a
full lookup flow key here. Using that flow key will do a fib_lookup()
and create/update the persistent entry.

Signed-off-by: David S. Miller <davem@davemloft.net>

6700c270

tcp: implement RFC 5961 3.2 · 282f23c6

Eric Dumazet authored Jul 17, 2012



Implement the RFC 5691 mitigation against Blind
Reset attack using RST bit.

Idea is to validate incoming RST sequence,
to match RCV.NXT value, instead of previouly accepted
window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)

If sequence is in window but not an exact match, send
a "challenge ACK", so that the other part can resend an
RST with the appropriate sequence.

Add a new sysctl, tcp_challenge_ack_limit, to limit
number of challenge ACK sent per second.

Add a new SNMP counter to count number of challenge acks sent.
(netstat -s | grep TCPChallengeACK)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kiran Kumar Kella <kkiran@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

282f23c6

net: make sock diag per-namespace · 51d7cccf

Andrey Vagin authored Jul 16, 2012



Before this patch sock_diag works for init_net only and dumps
information about sockets from all namespaces.

This patch expands sock_diag for all name-spaces.
It creates a netlink kernel socket for each netns and filters
data during dumping.

v2: filter accoding with netns in all places
    remove an unused variable.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Pavel Emelyanov <xemul@parallels.com>
CC: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

51d7cccf

tcp: add OFO snmp counters · a6df1ae9

Eric Dumazet authored Jul 16, 2012



Add three SNMP TCP counters, to better track TCP behavior
at global stage (netstat -s), when packets are received
Out Of Order (OFO)

TCPOFOQueue : Number of packets queued in OFO queue

TCPOFODrop  : Number of packets meant to be queued in OFO
              but dropped because socket rcvbuf limit hit.

TCPOFOMerge : Number of packets in OFO that were merged with
              other packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6df1ae9

Jul 16, 2012

ipv4: Add helper inet_csk_update_pmtu(). · 80d0a69f

David S. Miller authored Jul 16, 2012



This abstracts away the call to dst_ops->update_pmtu() so that we can
transparently handle the fact that, in the future, the dst itself can
be invalidated by the PMTU update (when we have non-host routes cached
in sockets).

So we try to rebuild the socket cached route after the method
invocation if necessary.

This isn't used by SCTP because it needs to cache dsts per-transport,
and thus will need it's own local version of this helper.

Signed-off-by: David S. Miller <davem@davemloft.net>

80d0a69f

Jul 13, 2012

ipv4: Don't store a rule pointer in fib_result. · 85b91b03

David S. Miller authored Jul 13, 2012



We only use it to fetch the rule's tclassid, so just store the
tclassid there instead.

This also decreases the size of fib_result by a full 8 bytes on
64-bit.  On 32-bits it's a wash.

Signed-off-by: David S. Miller <davem@davemloft.net>

85b91b03

tcp: add LAST_ACK as a valid state for TSQ · d01cb207

Eric Dumazet authored Jul 12, 2012

Socket state LAST_ACK should allow TSQ to send additional frames,
or else we rely on incoming ACKS or timers to send them.

Reported-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d01cb207

Jul 12, 2012

ipv4: Remove tb_peers from fib_table. · 391e5c22
David S. Miller authored Jul 12, 2012
```
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
391e5c22

ipv4: Put proper checks into icmp_socket_deliver(). · f0a70e90

David S. Miller authored Jul 12, 2012



All handler->err() routines expect that we've done a pskb_may_pull()
test to make sure that IP header length + 8 bytes can be safely
pulled.

Reported-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0a70e90

ipv4: Fix warnings in ip_do_redirect() for some configurations. · 99ee038d
David S. Miller authored Jul 12, 2012
```
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
99ee038d
net: Remove checks for dst_ops->redirect being NULL. · 1ed5c48f
David S. Miller authored Jul 12, 2012
```
No longer necessary.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
1ed5c48f
net: Add dummy dst_ops->redirect method where needed. · b587ee3b
David S. Miller authored Jul 12, 2012
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
b587ee3b

ipv4: Kill ip_rt_redirect(). · 1f42539d

David S. Miller authored Jul 11, 2012



No longer needed, as the protocol handlers now all properly
propagate the redirect back into the routing code.

Signed-off-by: David S. Miller <davem@davemloft.net>

1f42539d

ipv4: Add redirect support to all protocol icmp error handlers. · 55be7a9c
David S. Miller authored Jul 11, 2012
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
55be7a9c
ipv4: Add ipv4_redirect() and ipv4_sk_redirect() helper functions. · b42597e2
David S. Miller authored Jul 11, 2012
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
b42597e2
ipv4: Generalize ip_do_redirect() and hook into new dst_ops->redirect. · e47a185b
David S. Miller authored Jul 11, 2012
```
All of the redirect acceptance policy is now contained within.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
e47a185b

ipv4: Rearrange arguments to ip_rt_redirect() · 94206125

David S. Miller authored Jul 11, 2012



Pass in the SKB rather than just the IP addresses, so that policy
and other aspects can reside in ip_rt_redirect() rather then
icmp_redirect().

Signed-off-by: David S. Miller <davem@davemloft.net>

94206125

ipv4: Pull redirect instantiation out into a helper function. · d0da720f
David S. Miller authored Jul 11, 2012
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
d0da720f

ipv4: Deliver ICMP redirects to sockets too. · d3351b75

David S. Miller authored Jul 11, 2012



And thus, we can remove the ping_err() hack.

Signed-off-by: David S. Miller <davem@davemloft.net>

d3351b75

ipv4: Pull icmp socket delivery out into a helper function. · 1de9243b
David S. Miller authored Jul 11, 2012
```
Signed-off-by: David S. Miller <davem@davemloft.net>
```
1de9243b

tcp: TCP Small Queues · 46d3ceab

Eric Dumazet authored Jul 11, 2012



This introduce TSQ (TCP Small Queues)

TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
device queues), to reduce RTT and cwnd bias, part of the bufferbloat
problem.

sk->sk_wmem_alloc not allowed to grow above a given limit,
allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
given time.

TSO packets are sized/capped to half the limit, so that we have two
TSO packets in flight, allowing better bandwidth use.

As a side effect, setting the limit to 40000 automatically reduces the
standard gso max limit (65536) to 40000/2 : It can help to reduce
latencies of high prio packets, having smaller TSO packets.

This means we divert sock_wfree() to a tcp_wfree() handler, to
queue/send following frames when skb_orphan() [2] is called for the
already queued skbs.

Results on my dev machines (tg3/ixgbe nics) are really impressive,
using standard pfifo_fast, and with or without TSO/GSO.

Without reduction of nominal bandwidth, we have reduction of buffering
per bulk sender :
< 1ms on Gbit (instead of 50ms with TSO)
< 8ms on 100Mbit (instead of 132 ms)

I no longer have 4 MBytes backlogged in qdisc by a single netperf
session, and both side socket autotuning no longer use 4 Mbytes.

As skb destructor cannot restart xmit itself ( as qdisc lock might be
taken at this point ), we delegate the work to a tasklet. We use one
tasklest per cpu for performance reasons.

If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
This flag is tested in a new protocol method called from release_sock(),
to eventually send new segments.

[1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
[2] skb_orphan() is usually called at TX completion time,
  but some drivers call it in their start_xmit() handler.
  These drivers should at least use BQL, or else a single TCP
  session can still fill the whole NIC TX ring, since TSQ will
  have no effect.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46d3ceab

tcp: Fix out of bounds access to tcpm_vals · 2100844c

Alexander Duyck authored Jul 11, 2012



The recent patch "tcp: Maintain dynamic metrics in local cache." introduced
an out of bounds access due to what appears to be a typo.   I believe this
change should resolve the issue by replacing the access to RTAX_CWND with
TCP_METRIC_CWND.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2100844c

Jul 11, 2012

net: Fix non-kernel-doc comments with kernel-doc start marker · ae86b9e3

Ben Hutchings authored Jul 10, 2012



Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae86b9e3

net: Fix (nearly-)kernel-doc comments for various functions · 2c53040f

Ben Hutchings authored Jul 10, 2012



Fix incorrect start markers, wrapped summary lines, missing section
breaks, incorrect separators, and some name mismatches.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c53040f

ipv4: Remove inetpeer from routes. · f185071d
David S. Miller authored Jul 10, 2012
```
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
f185071d

ipv4: Calling ->cow_metrics() now is a bug. · 31248731

David S. Miller authored Jul 10, 2012



Nothing every writes to ipv4 metrics any longer.

PMTU is stored in rt->rt_pmtu.

Dynamic TCP metrics are stored in a special TCP metrics cache,
completely outside of the routes.

Therefore ->cow_metrics() can simply nothing more than a WARN_ON
trigger so we can catch anyone who tries to add new writes to
ipv4 route metrics.

Signed-off-by: David S. Miller <davem@davemloft.net>

31248731

ipv4: Kill dst_copy_metrics() call from ipv4_blackhole_route(). · 2db2d67e

David S. Miller authored Jul 10, 2012



Blackhole routes have a COW metrics operation that returns NULL
always, therefore this dst_copy_metrics() call did absolutely
nothing.

Signed-off-by: David S. Miller <davem@davemloft.net>

2db2d67e

ipv4: Enforce max MTU metric at route insertion time. · 710ab6c0
David S. Miller authored Jul 10, 2012
```
Rather than at every struct rtable creation.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
710ab6c0

ipv4: Maintain redirect and PMTU info in struct rtable again. · 5943634f

David S. Miller authored Jul 10, 2012



Maintaining this in the inetpeer entries was not the right way to do
this at all.

Signed-off-by: David S. Miller <davem@davemloft.net>

5943634f

rtnetlink: Remove ts/tsage args to rtnl_put_cacheinfo(). · 87a50699
David S. Miller authored Jul 10, 2012
```
Nobody provides non-zero values any longer.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
87a50699

inet: Kill FLOWI_FLAG_PRECOW_METRICS. · 3e12939a

David S. Miller authored Jul 10, 2012



No longer needed.  TCP writes metrics, but now in it's own special
cache that does not dirty the route metrics.  Therefore there is no
longer any reason to pre-cow metrics in this way.

Signed-off-by: David S. Miller <davem@davemloft.net>

3e12939a

inet: Minimize use of cached route inetpeer. · 1d861aa4

David S. Miller authored Jul 10, 2012



Only use it in the absolutely required cases:

1) COW'ing metrics

2) ipv4 PMTU

3) ipv4 redirects

Signed-off-by: David S. Miller <davem@davemloft.net>

1d861aa4

inet: Remove ->get_peer() method. · 16d18399
David S. Miller authored Jul 10, 2012
```
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
16d18399
tcp: Remove tw->tw_peer · b6242b9b
David S. Miller authored Jul 10, 2012
```
No longer used.

Signed-off-by: David S. Miller <davem@davemloft.net>
```
b6242b9b