- Apr 12, 2014
-
-
Nicolas Dichtel authored
Before the patch, it was possible to add two times the same tunnel: ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41 ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit b9959fd3 ("vti: switch to new ip tunnel code"). CC: Cong Wang <amwang@redhat.com> CC: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
Before the patch, it was possible to add two times the same tunnel: ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249 ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit c5441932 ("GRE: Refactor GRE tunneling code."). CC: Pravin B Shelar <pshelar@nicira.com> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
Similarly to commit 43279500 ("packet: respect devices with LLTX flag in direct xmit"), we can basically apply the very same to pktgen. This will help testing against LLTX devices such as dummy driver (or others), which only have a single netdevice txq and would otherwise require locking their txq from pktgen side while e.g. in dummy case, we would not need any locking. Fix this by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 11, 2014
-
-
Lorenzo Colitti authored
net-next commit 9c76a114, ipv6: tcp_ipv6 policy route issue, had a boolean logic error that caused incorrect behaviour for TCP SYN+ACK when oif-based rules are in use. Specifically: 1. If a SYN comes in from a global address, and sk_bound_dev_if is not set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif unset, because for global addresses, the incoming interface doesn't necessarily have any bearing on the interface the SYN+ACK is sent out on. 2. If a SYN comes in from a link-local address, and sk_bound_dev_if is set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif set to sk_bound_dev_if, because that's what the application requested. Signed-off-by:
Lorenzo Colitti <lorenzo@google.com> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Toshiaki Makita authored
br_allowed_ingress() has two problems. 1. If br_allowed_ingress() is called by br_handle_frame_finish() and vlan_untag() in br_allowed_ingress() fails, skb will be freed by both vlan_untag() and br_handle_frame_finish(). 2. If br_allowed_ingress() is called by br_dev_xmit() and br_allowed_ingress() fails, the skb will not be freed. Fix these two problems by freeing the skb in br_allowed_ingress() if it fails. Signed-off-by:
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
In case of tcp, gso_size contains the tcpmss. For UFO (udp fragmentation offloading) skbs, gso_size is the fragment payload size, i.e. we must not account for udp header size. Otherwise, when using virtio drivers, a to-be-forwarded UFO GSO packet will be needlessly fragmented in the forward path, because we think its individual segments are too large for the outgoing link. Fixes: fe6cc55f ("net: ip, ipv6: handle gso skbs in forwarding path") Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by:
Tobias Brunner <tobias@strongswan.org> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 09, 2014
-
-
Dmitry Petukhov authored
When l2tp driver tries to get PMTU for the tunnel destination, it uses the pointer to struct sock that represents PPPoX socket, while it should use the pointer that represents UDP socket of the tunnel. Signed-off-by:
Dmitry Petukhov <dmgenp@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
In function sctp_wake_up_waiters(), we need to involve a test if the association is declared dead. If so, we don't have any reference to a possible sibling association anymore and need to invoke sctp_write_space() instead, and normally walk the socket's associations and notify them of new wmem space. The reason for special casing is that otherwise, we could run into the following issue when a sctp_primitive_SEND() call from sctp_sendmsg() fails, and tries to flush an association's outq, i.e. in the following way: sctp_association_free() `-> list_del(&asoc->asocs) <-- poisons list pointer asoc->base.dead = true sctp_outq_free(&asoc->outqueue) `-> __sctp_outq_teardown() `-> sctp_chunk_free() `-> consume_skb() `-> sctp_wfree() `-> sctp_wake_up_waiters() <-- dereferences poisoned pointers if asoc->ep->sndbuf_policy=0 Therefore, only walk the list in an 'optimized' way if we find that the current association is still active. We could also use list_del_init() in addition when we call sctp_association_free(), but as Vlad suggests, we want to trap such bugs and thus leave it poisoned as is. Why is it safe to resolve the issue by testing for asoc->base.dead? Parallel calls to sctp_sendmsg() are protected under socket lock, that is lock_sock()/release_sock(). Only within that path under lock held, we're setting skb/chunk owner via sctp_set_owner_w(). Eventually, chunks are freed directly by an association still under that lock. So when traversing association list on destruction time from sctp_wake_up_waiters() via sctp_wfree(), a different CPU can't be running sctp_wfree() while another one calls sctp_association_free() as both happens under the same lock. Therefore, this can also not race with setting/testing against asoc->base.dead as we are guaranteed for this to happen in order, under lock. Further, Vlad says: the times we check asoc->base.dead is when we've cached an association pointer for later processing. In between cache and processing, the association may have been freed and is simply still around due to reference counts. We check asoc->base.dead under a lock, so it should always be safe to check and not race against sctp_association_free(). Stress-testing seems fine now, too. Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 08, 2014
-
-
Daniel Borkmann authored
SCTP charges chunks for wmem accounting via skb->truesize in sctp_set_owner_w(), and sctp_wfree() respectively as the reverse operation. If a sender runs out of wmem, it needs to wait via sctp_wait_for_sndbuf(), and gets woken up by a call to __sctp_write_space() mostly via sctp_wfree(). __sctp_write_space() is being called per association. Although we assign sk->sk_write_space() to sctp_write_space(), which is then being done per socket, it is only used if send space is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE is set and therefore not invoked in sock_wfree(). Commit 4c3a5bda ("sctp: Don't charge for data in sndbuf again when transmitting packet") fixed an issue where in case sctp_packet_transmit() manages to queue up more than sndbuf bytes, sctp_wait_for_sndbuf() will never be woken up again unless it is interrupted by a signal. However, a still remaining issue is that if net.sctp.sndbuf_policy=0, that is accounting per socket, and one-to-many sockets are in use, the reclaimed write space from sctp_wfree() is 'unfairly' handed back on the server to the association that is the lucky one to be woken up again via __sctp_write_space(), while the remaining associations are never be woken up again (unless by a signal). The effect disappears with net.sctp.sndbuf_policy=1, that is wmem accounting per association, as it guarantees a fair share of wmem among associations. Therefore, if we have reclaimed memory in case of per socket accounting, wake all related associations to a socket in a fair manner, that is, traverse the socket association list starting from the current neighbour of the association and issue a __sctp_write_space() to everyone until we end up waking ourselves. This guarantees that no association is preferred over another and even if more associations are taken into the one-to-many session, all receivers will get messages from the server and are not stalled forever on high load. This setting still leaves the advantage of per socket accounting in touch as an association can still use up global limits if unused by others. Fixes: 4eb701df ("[SCTP] Fix SCTP sendbuffer accouting.") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 07, 2014
-
-
Christoph Lameter authored
The RT_CACHE_STAT_INC macro triggers the new preemption checks for __this_cpu ops. I do not see any other synchronization that would allow the use of a __this_cpu operation here however in commit dbd2915c ("[IPV4]: RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of raw_smp_processor_id() here because "we do not care" about races. In the past we agreed that the price of disabling interrupts here to get consistent counters would be too high. These counters may be inaccurate due to race conditions. The use of __this_cpu op improves the situation already from what commit dbd2915c did since the single instruction emitted on x86 does not allow the race to occur anymore. However, non x86 platforms could still experience a race here. Trace: __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193 caller is __this_cpu_preempt_check+0x38/0x60 CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187 Call Trace: check_preemption_disabled+0xec/0x110 __this_cpu_preempt_check+0x38/0x60 __ip_route_output_key+0x575/0x8c0 ip_route_output_flow+0x27/0x70 udp_sendmsg+0x825/0xa20 inet_sendmsg+0x85/0xc0 sock_sendmsg+0x9c/0xd0 ___sys_sendmsg+0x37c/0x390 __sys_sendmsg+0x49/0x90 SyS_sendmsg+0x12/0x20 tracesys+0xe1/0xe6 Signed-off-by:
Christoph Lameter <cl@linux.com> Acked-by:
David S. Miller <davem@davemloft.net> Acked-by:
Ingo Molnar <mingo@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Veaceslav Falico authored
Currently we're checking a variable for != NULL after actually dereferencing it, in netdev_lower_get_next_private*(). It's counter-intuitive at best, and can lead to faulty usage (as it implies that the variable can be NULL), so fix it by removing the useless checks. Reported-by:
Daniel Borkmann <dborkman@redhat.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Jiri Pirko <jiri@resnulli.us> CC: stephen hemminger <stephen@networkplumber.org> CC: Jerry Chu <hkchu@google.com> Signed-off-by:
Veaceslav Falico <vfalico@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
Similarly as in commit 8e2f1a63 ("packet: fix packet_direct_xmit for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit in pktgen's xmit, which would not fully fill the device's TX ring for BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Geert Uytterhoeven authored
net/tipc/socket.c: In function ‘tipc_release’: net/tipc/socket.c:352: warning: ‘res’ is used uninitialized in this function Introduced by commit 24be34b5 ("tipc: eliminate upcall function pointers between port and socket"), which removed the sole initializer of "res". Just return 0 to fix it. Signed-off-by:
Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jean Sacren authored
The commit e6278d92 ("mac802154: use header operations to create/parse headers") included the header net/ieee802154_netdev.h which had been included by the commit b70ab2e8 ("ieee802154: enforce consistent endianness in the 802.15.4 stack"). Fix this duplicate #include by deleting the latter one as the required header has already been in place. Signed-off-by:
Jean Sacren <sakiwit@gmail.com> Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Cc: linux-zigbee-devel@lists.sourceforge.net Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
The old interpreter behaviour was that we returned with 0 whenever we found a division by 0 would take place. In the new interpreter we would currently just skip that instead and continue execution. It's true that a value of 0 as return might not be appropriate in all cases, but current users (socket filters -> drop packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified, etc) seem fine with that behaviour. Better this than undefined BPF program behaviour as it's expected that A contains the result of the division. In future, as more use cases open up, we could further adapt this return value to our needs, if necessary. So reintroduce return of 0 for division by 0 as in the old interpreter. Also in case of K which is guaranteed to be 32bit wide, sk_chk_filter() already takes care of preventing division by 0 invoked through K, so we can generally spare us these tests. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Reviewed-by:
Alexei Starovoitov <ast@plumgrid.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 05, 2014
-
-
Thomas Graf authored
All xtables variants suffer from the defect that the copy_to_user() to copy the counters to user memory may fail after the table has already been exchanged and thus exposed. Return an error at this point will result in freeing the already exposed table. Any subsequent packet processing will result in a kernel panic. We can't copy the counters before exposing the new tables as we want provide the counter state after the old table has been unhooked. Therefore convert this into a silent error. Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
Thomas Graf <tgraf@suug.ch> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Ilya Dryomov authored
Dump pool {read,write}_tier to debugfs. While at it, fixup printk type specifiers and remove the unnecessary cast to unsigned long long. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com>
-
Ilya Dryomov authored
Similar to osd weights, output primary affinity values on incremental osdmap updates. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com>
-
Ilya Dryomov authored
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() and get rid of the now unused calc_pg_raw(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Respond to non-default primary_affinity values accordingly. (Primary affinity allows the admin to shift 'primary responsibility' away from specific osds, effectively shifting around the read side of the workload and whatever overhead is incurred by peering and writes by virtue of being the primary). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Change apply_temp() to override primary in the same way pg_temp overrides osd set. primary_temp overrides pg_temp primary too. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
apply_temp() helper for applying various temporary mappings (at this point only pg_temp mappings) to the up set, therefore transforming it into an acting set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
pg_to_raw_osds() helper for computing a raw (crush) set, which can contain non-existant and down osds. raw_to_up_osds() helper for pruning non-existant and down osds from the raw set, therefore transforming it into an up set, and determining up primary. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add two helpers to decode primary_affinity (full map, vector<u32>) and new_primary_affinity (inc map, map<u32, u32>) and switch to them. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add a common helper to decode both primary_temp (full map, map<pg_t, u32>) and new_primary_temp (inc map, same) and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add primary_temp mappings infrastructure. struct ceph_pg_mapping is overloaded, primary_temp mappings are stored in an rb-tree, rooted at ceph_osdmap, in a manner similar to pg_temp mappings. Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'primary_temp <pgid> <osd>' per line, e.g: primary_temp 2.6 4 Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
In preparation for adding support for primary_temp mappings, generalize struct ceph_pg_mapping so it can hold mappings other than pg_temp. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Full and incremental osdmaps are structured identically and have identical headers. Add a helper to decode both "old" (16-bit version, v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap enconding headers and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Use krealloc() instead of rolling our own. (krealloc() with a NULL first argument acts as a kmalloc()). Properly initalize the new array elements. This is needed to make future additions to osdmap easier. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
To be in line with all the other osdmap decode helpers. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Sum up sizeof(...) results instead of (incorrectly) hard-coding the number of bytes, expressed in ints and longs. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Only version 6 of osdmap encoding is supported, anything other than version 6 results in an error and halts the decoding process. Checking if version is >= 5 is therefore bogus. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Follow osdmap_decode() and fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
The size of the memory area feeded to crush_decode() should be limited not only by osdmap end, but also by the crush map length. Also, drop unnecessary dout() (dout() in crush_decode() conveys the same info) and step past crush map only if it is decoded successfully. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-