- Jun 04, 2014
-
-
Chuck Lever authored
Clean up: rpcrdma_ep_destroy() returns a value that is used only to print a debugging message. rpcrdma_ep_destroy() already prints debugging messages in all error cases. Make rpcrdma_ep_destroy() return void instead. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: All remaining callers of rpcrdma_deregister_external() pass NULL as the last argument, so remove that argument. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
If the selected memory registration mode is not supported by the underlying provider/HCA, the NFS mount command reports that there was an invalid mount option, and fails. This is misleading. Reporting a problem allocating memory is a lot closer to the truth. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
An audit of in-kernel RDMA providers that do not support the FRMR memory registration shows that several of them support MTHCAFMR. Prefer MTHCAFMR when FRMR is not supported. If MTHCAFMR is not supported, only then choose ALLPHYSICAL. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
All kernel RDMA providers except amso1100 support either MTHCAFMR or FRMR, both of which are faster than REGISTER. amso1100 can continue to use ALLPHYSICAL. The only other ULP consumer in the kernel that uses the reg_phys_mr verb is Lustre. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
The MEMWINDOWS and MEMWINDOWS_ASYNC memory registration modes were intended as stop-gap modes before the introduction of FRMR. They are now considered obsolete. MEMWINDOWS_ASYNC is also considered unsafe because it can leave client memory registered and exposed for an indeterminant time after each I/O. At this point, the MEMWINDOWS modes add needless complexity, so remove them. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up: This memory registration mode is slow and was never meant for use in production environments. Remove it to reduce implementation complexity. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
An IB provider can invoke rpcrdma_conn_func() in an IRQ context, thus rpcrdma_conn_func() cannot be allowed to directly invoke generic RPC functions like xprt_wake_pending_tasks(). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Allen Andrews authored
Two memory region leaks were found during testing: 1. rpcrdma_buffer_create: While allocating RPCRDMA_FRMR's ib_alloc_fast_reg_mr is called and then ib_alloc_fast_reg_page_list is called. If ib_alloc_fast_reg_page_list returns an error it bails out of the routine dropping the last ib_alloc_fast_reg_mr frmr region creating a memory leak. Added code to dereg the last frmr if ib_alloc_fast_reg_page_list fails. 2. rpcrdma_buffer_destroy: While cleaning up, the routine will only free the MR's on the rb_mws list if there are rb_send_bufs present. However, in rpcrdma_buffer_create while the rb_mws list is being built if one of the MR allocation requests fail after some MR's have been allocated on the rb_mws list the routine never gets to create any rb_send_bufs but instead jumps to the rpcrdma_buffer_destroy routine which will never free the MR's on rb_mws list because the rb_send_bufs were never created. This leaks all the MR's on the rb_mws list that were created prior to one of the MR allocations failing. Issue(2) was seen during testing. Our adapter had a finite number of MR's available and we created enough connections to where we saw an MR allocation failure on our Nth NFS connection request. After the kernel cleaned up the resources it had allocated for the Nth connection we noticed that FMR's had been leaked due to the coding error described above. Issue(1) was seen during a code review while debugging issue(2). Signed-off-by:
Allen Andrews <allen.andrews@emulex.com> Reviewed-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Steve Wise authored
Some rdma devices don't support a fast register page list depth of at least RPCRDMA_MAX_DATA_SEGS. So xprtrdma needs to chunk its fast register regions according to the minimum of the device max supported depth or RPCRDMA_MAX_DATA_SEGS. Signed-off-by:
Steve Wise <swise@opengridcomputing.com> Reviewed-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- May 29, 2014
-
-
David Rientjes authored
rpc_malloc() allocates with GFP_NOWAIT without making any attempt at reclaim so it easily fails when low on memory. This ends up spamming the kernel log: SLAB: Unable to allocate memory on node 0 (gfp=0x4000) cache: kmalloc-8192, object size: 8192, order: 1 node 0: slabs: 207/207, objs: 207/207, free: 0 rekonq: page allocation failure: order:1, mode:0x204000 CPU: 2 PID: 14321 Comm: rekonq Tainted: G O 3.15.0-rc3-12.gfc9498b-desktop+ #6 Hardware name: System manufacturer System Product Name/M4A785TD-V EVO, BIOS 2105 07/23/2010 0000000000000000 ffff880010ff17d0 ffffffff815e693c 0000000000204000 ffff880010ff1858 ffffffff81137bd2 0000000000000000 0000001000000000 ffff88011ffebc38 0000000000000001 0000000000204000 ffff88011ffea000 Call Trace: [<ffffffff815e693c>] dump_stack+0x4d/0x6f [<ffffffff81137bd2>] warn_alloc_failed+0xd2/0x140 [<ffffffff8113be19>] __alloc_pages_nodemask+0x7e9/0xa30 [<ffffffff811824a8>] kmem_getpages+0x58/0x140 [<ffffffff81183de6>] fallback_alloc+0x1d6/0x210 [<ffffffff81183be3>] ____cache_alloc_node+0x123/0x150 [<ffffffff81185953>] __kmalloc+0x203/0x490 [<ffffffffa06b0ee2>] rpc_malloc+0x32/0xa0 [sunrpc] [<ffffffffa06a6999>] call_allocate+0xb9/0x170 [sunrpc] [<ffffffffa06b19d8>] __rpc_execute+0x88/0x460 [sunrpc] [<ffffffffa06b2da9>] rpc_execute+0x59/0xc0 [sunrpc] [<ffffffffa06a932b>] rpc_run_task+0x6b/0x90 [sunrpc] [<ffffffffa077b5c1>] nfs4_call_sync_sequence+0x51/0x80 [nfsv4] [<ffffffffa077d45d>] _nfs4_do_setattr+0x1ed/0x280 [nfsv4] [<ffffffffa0782a72>] nfs4_do_setattr+0x72/0x180 [nfsv4] [<ffffffffa078334c>] nfs4_proc_setattr+0xbc/0x140 [nfsv4] [<ffffffffa074a7e8>] nfs_setattr+0xd8/0x240 [nfs] [<ffffffff811baa71>] notify_change+0x231/0x380 [<ffffffff8119cf5c>] chmod_common+0xfc/0x120 [<ffffffff8119df80>] SyS_chmod+0x40/0x90 [<ffffffff815f4cfd>] system_call_fastpath+0x1a/0x1f ... If the allocation fails, simply return NULL and avoid spamming the kernel log. Reported-by:
Marc Dietrich <marvin24@gmx.de> Signed-off-by:
David Rientjes <rientjes@google.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- May 18, 2014
-
-
Trond Myklebust authored
We're not taking a reference in the case where _gss_mech_get_by_pseudoflavor loops without finding the correct rpcsec_gss flavour, so why are we releasing it? Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Apr 12, 2014
-
-
Nicolas Dichtel authored
Before the patch, it was possible to add two times the same tunnel: ip l a vti1 type vti remote 10.16.0.121 local 10.16.0.249 key 41 ip l a vti2 type vti remote 10.16.0.121 local 10.16.0.249 key 41 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit b9959fd3 ("vti: switch to new ip tunnel code"). CC: Cong Wang <amwang@redhat.com> CC: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Nicolas Dichtel authored
Before the patch, it was possible to add two times the same tunnel: ip l a gre1 type gre remote 10.16.0.121 local 10.16.0.249 ip l a gre2 type gre remote 10.16.0.121 local 10.16.0.249 It was possible, because ip_tunnel_newlink() calls ip_tunnel_find() with the argument dev->type, which was set only later (when calling ndo_init handler in register_netdevice()). Let's set this type in the setup handler, which is called before newlink handler. Introduced by commit c5441932 ("GRE: Refactor GRE tunneling code."). CC: Pravin B Shelar <pshelar@nicira.com> Signed-off-by:
Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
Similarly to commit 43279500 ("packet: respect devices with LLTX flag in direct xmit"), we can basically apply the very same to pktgen. This will help testing against LLTX devices such as dummy driver (or others), which only have a single netdevice txq and would otherwise require locking their txq from pktgen side while e.g. in dummy case, we would not need any locking. Fix this by making use of HARD_TX_{UN,}LOCK API, so that NETIF_F_LLTX will be respected. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 11, 2014
-
-
Lorenzo Colitti authored
net-next commit 9c76a114, ipv6: tcp_ipv6 policy route issue, had a boolean logic error that caused incorrect behaviour for TCP SYN+ACK when oif-based rules are in use. Specifically: 1. If a SYN comes in from a global address, and sk_bound_dev_if is not set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif unset, because for global addresses, the incoming interface doesn't necessarily have any bearing on the interface the SYN+ACK is sent out on. 2. If a SYN comes in from a link-local address, and sk_bound_dev_if is set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif set to sk_bound_dev_if, because that's what the application requested. Signed-off-by:
Lorenzo Colitti <lorenzo@google.com> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Several spots in the kernel perform a sequence like: skb_queue_tail(&sk->s_receive_queue, skb); sk->sk_data_ready(sk, skb->len); But at the moment we place the SKB onto the socket receive queue it can be consumed and freed up. So this skb->len access is potentially to freed up memory. Furthermore, the skb->len can be modified by the consumer so it is possible that the value isn't accurate. And finally, no actual implementation of this callback actually uses the length argument. And since nobody actually cared about it's value, lots of call sites pass arbitrary values in such as '0' and even '1'. So just remove the length argument from the callback, that way there is no confusion whatsoever and all of these use-after-free cases get fixed as a side effect. Based upon a patch by Eric Dumazet and his suggestion to audit this issue tree-wide. Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Toshiaki Makita authored
br_allowed_ingress() has two problems. 1. If br_allowed_ingress() is called by br_handle_frame_finish() and vlan_untag() in br_allowed_ingress() fails, skb will be freed by both vlan_untag() and br_handle_frame_finish(). 2. If br_allowed_ingress() is called by br_dev_xmit() and br_allowed_ingress() fails, the skb will not be freed. Fix these two problems by freeing the skb in br_allowed_ingress() if it fails. Signed-off-by:
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
In case of tcp, gso_size contains the tcpmss. For UFO (udp fragmentation offloading) skbs, gso_size is the fragment payload size, i.e. we must not account for udp header size. Otherwise, when using virtio drivers, a to-be-forwarded UFO GSO packet will be needlessly fragmented in the forward path, because we think its individual segments are too large for the outgoing link. Fixes: fe6cc55f ("net: ip, ipv6: handle gso skbs in forwarding path") Cc: Eric Dumazet <eric.dumazet@gmail.com> Reported-by:
Tobias Brunner <tobias@strongswan.org> Signed-off-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 09, 2014
-
-
Dmitry Petukhov authored
When l2tp driver tries to get PMTU for the tunnel destination, it uses the pointer to struct sock that represents PPPoX socket, while it should use the pointer that represents UDP socket of the tunnel. Signed-off-by:
Dmitry Petukhov <dmgenp@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
In function sctp_wake_up_waiters(), we need to involve a test if the association is declared dead. If so, we don't have any reference to a possible sibling association anymore and need to invoke sctp_write_space() instead, and normally walk the socket's associations and notify them of new wmem space. The reason for special casing is that otherwise, we could run into the following issue when a sctp_primitive_SEND() call from sctp_sendmsg() fails, and tries to flush an association's outq, i.e. in the following way: sctp_association_free() `-> list_del(&asoc->asocs) <-- poisons list pointer asoc->base.dead = true sctp_outq_free(&asoc->outqueue) `-> __sctp_outq_teardown() `-> sctp_chunk_free() `-> consume_skb() `-> sctp_wfree() `-> sctp_wake_up_waiters() <-- dereferences poisoned pointers if asoc->ep->sndbuf_policy=0 Therefore, only walk the list in an 'optimized' way if we find that the current association is still active. We could also use list_del_init() in addition when we call sctp_association_free(), but as Vlad suggests, we want to trap such bugs and thus leave it poisoned as is. Why is it safe to resolve the issue by testing for asoc->base.dead? Parallel calls to sctp_sendmsg() are protected under socket lock, that is lock_sock()/release_sock(). Only within that path under lock held, we're setting skb/chunk owner via sctp_set_owner_w(). Eventually, chunks are freed directly by an association still under that lock. So when traversing association list on destruction time from sctp_wake_up_waiters() via sctp_wfree(), a different CPU can't be running sctp_wfree() while another one calls sctp_association_free() as both happens under the same lock. Therefore, this can also not race with setting/testing against asoc->base.dead as we are guaranteed for this to happen in order, under lock. Further, Vlad says: the times we check asoc->base.dead is when we've cached an association pointer for later processing. In between cache and processing, the association may have been freed and is simply still around due to reference counts. We check asoc->base.dead under a lock, so it should always be safe to check and not race against sctp_association_free(). Stress-testing seems fine now, too. Fixes: cd253f9f357d ("net: sctp: wake up all assocs if sndbuf policy is per socket") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 08, 2014
-
-
Daniel Borkmann authored
SCTP charges chunks for wmem accounting via skb->truesize in sctp_set_owner_w(), and sctp_wfree() respectively as the reverse operation. If a sender runs out of wmem, it needs to wait via sctp_wait_for_sndbuf(), and gets woken up by a call to __sctp_write_space() mostly via sctp_wfree(). __sctp_write_space() is being called per association. Although we assign sk->sk_write_space() to sctp_write_space(), which is then being done per socket, it is only used if send space is increased per socket option (SO_SNDBUF), as SOCK_USE_WRITE_QUEUE is set and therefore not invoked in sock_wfree(). Commit 4c3a5bda ("sctp: Don't charge for data in sndbuf again when transmitting packet") fixed an issue where in case sctp_packet_transmit() manages to queue up more than sndbuf bytes, sctp_wait_for_sndbuf() will never be woken up again unless it is interrupted by a signal. However, a still remaining issue is that if net.sctp.sndbuf_policy=0, that is accounting per socket, and one-to-many sockets are in use, the reclaimed write space from sctp_wfree() is 'unfairly' handed back on the server to the association that is the lucky one to be woken up again via __sctp_write_space(), while the remaining associations are never be woken up again (unless by a signal). The effect disappears with net.sctp.sndbuf_policy=1, that is wmem accounting per association, as it guarantees a fair share of wmem among associations. Therefore, if we have reclaimed memory in case of per socket accounting, wake all related associations to a socket in a fair manner, that is, traverse the socket association list starting from the current neighbour of the association and issue a __sctp_write_space() to everyone until we end up waking ourselves. This guarantees that no association is preferred over another and even if more associations are taken into the one-to-many session, all receivers will get messages from the server and are not stalled forever on high load. This setting still leaves the advantage of per socket accounting in touch as an association can still use up global limits if unused by others. Fixes: 4eb701df ("[SCTP] Fix SCTP sendbuffer accouting.") Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Neil Horman <nhorman@tuxdriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 07, 2014
-
-
Christoph Lameter authored
The RT_CACHE_STAT_INC macro triggers the new preemption checks for __this_cpu ops. I do not see any other synchronization that would allow the use of a __this_cpu operation here however in commit dbd2915c ("[IPV4]: RT_CACHE_STAT_INC() warning fix") Andrew justifies the use of raw_smp_processor_id() here because "we do not care" about races. In the past we agreed that the price of disabling interrupts here to get consistent counters would be too high. These counters may be inaccurate due to race conditions. The use of __this_cpu op improves the situation already from what commit dbd2915c did since the single instruction emitted on x86 does not allow the race to occur anymore. However, non x86 platforms could still experience a race here. Trace: __this_cpu_add operation in preemptible [00000000] code: avahi-daemon/1193 caller is __this_cpu_preempt_check+0x38/0x60 CPU: 1 PID: 1193 Comm: avahi-daemon Tainted: GF 3.12.0-rc4+ #187 Call Trace: check_preemption_disabled+0xec/0x110 __this_cpu_preempt_check+0x38/0x60 __ip_route_output_key+0x575/0x8c0 ip_route_output_flow+0x27/0x70 udp_sendmsg+0x825/0xa20 inet_sendmsg+0x85/0xc0 sock_sendmsg+0x9c/0xd0 ___sys_sendmsg+0x37c/0x390 __sys_sendmsg+0x49/0x90 SyS_sendmsg+0x12/0x20 tracesys+0xe1/0xe6 Signed-off-by:
Christoph Lameter <cl@linux.com> Acked-by:
David S. Miller <davem@davemloft.net> Acked-by:
Ingo Molnar <mingo@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Veaceslav Falico authored
Currently we're checking a variable for != NULL after actually dereferencing it, in netdev_lower_get_next_private*(). It's counter-intuitive at best, and can lead to faulty usage (as it implies that the variable can be NULL), so fix it by removing the useless checks. Reported-by:
Daniel Borkmann <dborkman@redhat.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Jiri Pirko <jiri@resnulli.us> CC: stephen hemminger <stephen@networkplumber.org> CC: Jerry Chu <hkchu@google.com> Signed-off-by:
Veaceslav Falico <vfalico@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
Similarly as in commit 8e2f1a63 ("packet: fix packet_direct_xmit for BQL enabled drivers"), we test for __QUEUE_STATE_STACK_XOFF bit in pktgen's xmit, which would not fully fill the device's TX ring for BQL drivers that use netdev_tx_sent_queue(). Fix is to use, similarly as we do in packet sockets, netif_xmit_frozen_or_drv_stopped() test. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Geert Uytterhoeven authored
net/tipc/socket.c: In function ‘tipc_release’: net/tipc/socket.c:352: warning: ‘res’ is used uninitialized in this function Introduced by commit 24be34b5 ("tipc: eliminate upcall function pointers between port and socket"), which removed the sole initializer of "res". Just return 0 to fix it. Signed-off-by:
Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jean Sacren authored
The commit e6278d92 ("mac802154: use header operations to create/parse headers") included the header net/ieee802154_netdev.h which had been included by the commit b70ab2e8 ("ieee802154: enforce consistent endianness in the 802.15.4 stack"). Fix this duplicate #include by deleting the latter one as the required header has already been in place. Signed-off-by:
Jean Sacren <sakiwit@gmail.com> Cc: Alexander Smirnov <alex.bluesman.smirnov@gmail.com> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Cc: linux-zigbee-devel@lists.sourceforge.net Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Daniel Borkmann authored
The old interpreter behaviour was that we returned with 0 whenever we found a division by 0 would take place. In the new interpreter we would currently just skip that instead and continue execution. It's true that a value of 0 as return might not be appropriate in all cases, but current users (socket filters -> drop packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified, etc) seem fine with that behaviour. Better this than undefined BPF program behaviour as it's expected that A contains the result of the division. In future, as more use cases open up, we could further adapt this return value to our needs, if necessary. So reintroduce return of 0 for division by 0 as in the old interpreter. Also in case of K which is guaranteed to be 32bit wide, sk_chk_filter() already takes care of preventing division by 0 invoked through K, so we can generally spare us these tests. Signed-off-by:
Daniel Borkmann <dborkman@redhat.com> Reviewed-by:
Alexei Starovoitov <ast@plumgrid.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 05, 2014
-
-
Thomas Graf authored
All xtables variants suffer from the defect that the copy_to_user() to copy the counters to user memory may fail after the table has already been exchanged and thus exposed. Return an error at this point will result in freeing the already exposed table. Any subsequent packet processing will result in a kernel panic. We can't copy the counters before exposing the new tables as we want provide the counter state after the old table has been unhooked. Therefore convert this into a silent error. Cc: Florian Westphal <fw@strlen.de> Signed-off-by:
Thomas Graf <tgraf@suug.ch> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Ilya Dryomov authored
Dump pool {read,write}_tier to debugfs. While at it, fixup printk type specifiers and remove the unnecessary cast to unsigned long long. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com>
-
Ilya Dryomov authored
Similar to osd weights, output primary affinity values on incremental osdmap updates. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com>
-
Ilya Dryomov authored
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() and get rid of the now unused calc_pg_raw(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Respond to non-default primary_affinity values accordingly. (Primary affinity allows the admin to shift 'primary responsibility' away from specific osds, effectively shifting around the read side of the workload and whatever overhead is incurred by peering and writes by virtue of being the primary). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Change apply_temp() to override primary in the same way pg_temp overrides osd set. primary_temp overrides pg_temp primary too. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
apply_temp() helper for applying various temporary mappings (at this point only pg_temp mappings) to the up set, therefore transforming it into an acting set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
pg_to_raw_osds() helper for computing a raw (crush) set, which can contain non-existant and down osds. raw_to_up_osds() helper for pruning non-existant and down osds from the raw set, therefore transforming it into an up set, and determining up primary. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add two helpers to decode primary_affinity (full map, vector<u32>) and new_primary_affinity (inc map, map<u32, u32>) and switch to them. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-