- Sep 06, 2016
-
-
Chuck Lever authored
Receive buffer exhaustion, if it were to actually occur, would be catastrophic. However, when there are no reply buffers to post, that means all of them have already been posted and are waiting for incoming replies. By design, there can never be more RPCs in flight than there are available receive buffers. A receive buffer can be left posted after an RPC exits without a received reply; say, due to a credential problem or a soft timeout. This does not result in fewer posted receive buffers than there are pending RPCs, and there is already logic in xprtrdma to deal appropriately with this case. It also looks like the "+ 2" that was removed was accidentally accommodating the number of extra receive buffers needed for receiving backchannel requests. That will need to be addressed by another patch. Fixes: 3d4cf35b ("xprtrdma: Reply buffer exhaustion can be...") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Sep 03, 2016
-
-
Paolo Abeni authored
The commit f9b2ee71 ("SUNRPC: Move UDP receive data path into a workqueue context"), as a side effect, moved the skb_free_datagram() call outside the scope of the related socket lock, but UDP sockets require such lock to be held for proper memory accounting. Fix it by replacing skb_free_datagram() with skb_free_datagram_locked(). Fixes: f9b2ee71 ("SUNRPC: Move UDP receive data path into a workqueue context") Reported-and-tested-by:
Jan Stancek <jstancek@redhat.com> Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org # 4.4+ Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Aug 25, 2016
-
-
Chuck Lever authored
Using NFSv4.1 on RDMA should be safe, so broaden the new checks in rpc_create(). WARN_ON_ONCE is used, matching most other WARN call sites in clnt.c. Fixes: 39a9beab ("rpc: share one xps between all backchannels") Fixes: d50039ea ("nfsd4/rpc: move backchannel create logic...") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
J. Bruce Fields <bfields@fieldses.org> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Aug 05, 2016
-
-
Trond Myklebust authored
We don't want to miss a lease period renewal due to the TCP connection failing to reconnect in a timely fashion. To ensure this doesn't happen, cap the reconnection timer so that we retry the connection attempt at least every 1/2 lease period. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
...and ensure that we propagate it to new transports on the same client. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Trond Myklebust authored
When the connect attempt fails and backs off, we should start the clock at the last connection attempt, not time at which we queue up the reconnect job. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
NeilBrown authored
If the net.ipv6.conf.*.use_temp_addr sysctl is set to '2', then TCP connections over IPv6 will prefer a 'private' source address. These eventually expire and become invalid, typically after a week, but the time is configurable. When the local address becomes invalid the client will not be able to receive replies from the server. Eventually the connection will timeout or break and a new connection will be established, but this can take half an hour (typically TCP connection break time). RFC 4941, which describes private IPv6 addresses, acknowledges that some applications might not work well with them and that the application may explicitly a request non-temporary (i.e. "public") address. I believe this is correct for SUNRPC clients. Without this change, a client will occasionally experience a long delay if private addresses have been enabled. The privacy offered by private addresses is of little value for an NFS server which requires client authentication. For NFSv3 this will often not be a problem because idle connections are closed after 5 minutes. For NFSv4 connections never go idle due to the period RENEW (or equivalent) request. Signed-off-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Olga Kornievskaia authored
It's possible to have simultaneous upcalls for the same UIDs but different GSS service. In that case, we need to allow for the upcall to gssd to proceed so that not the same context is used by two different GSS services. Some servers lock the use of context to the GSS service. Signed-off-by:
Olga Kornievskaia <kolga@netapp.com> Cc: stable@vger.kernel.org # v3.9+ Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Aug 02, 2016
-
-
Trond Myklebust authored
Ensure that we don't forget to set up the disconnection timer for the case when a connect request is fulfilled after the RPC request that initiated it has timed out or been interrupted. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Aug 01, 2016
-
-
Trond Myklebust authored
This modification is useful for debugging issues that happen while the socket is being initialised. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
We're seeing traces of the following form: [10952.396347] svc: transport ffff88042ba4a 000 dequeued, inuse=2 [10952.396351] svc: tcp_accept ffff88042ba4 a000 sock ffff88042a6e4c80 [10952.396362] nfsd: connect from 10.2.6.1, port=187 [10952.396364] svc: svc_setup_socket ffff8800b99bcf00 [10952.396368] setting up TCP socket for reading [10952.396370] svc: svc_setup_socket created ffff8803eb10a000 (inet ffff88042b75b800) [10952.396373] svc: transport ffff8803eb10a000 put into queue [10952.396375] svc: transport ffff88042ba4a000 put into queue [10952.396377] svc: server ffff8800bb0ec000 waiting for data (to = 3600000) [10952.396380] svc: transport ffff8803eb10a000 dequeued, inuse=2 [10952.396381] svc_recv: found XPT_CLOSE [10952.396397] svc: svc_delete_xprt(ffff8803eb10a000) [10952.396398] svc: svc_tcp_sock_detach(ffff8803eb10a000) [10952.396399] svc: svc_sock_detach(ffff8803eb10a000) [10952.396412] svc: svc_sock_free(ffff8803eb10a000) i.e. an immediate close of the socket after initialisation. The culprit appears to be the test at the end of svc_tcp_init, which checks if the newly created socket is in the TCP_ESTABLISHED state, and immediately closes it if not. The evidence appears to suggest that the socket might still be in the SYN_RECV state at this time. The fix is to check for both states, and then to add a check in svc_tcp_state_change() to ensure we don't close the socket when it transitions into TCP_ESTABLISHED. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
If the connect attempt immediately fails with an EADDRNOTAVAIL error, then that means our choice of source port number was bad. This error is expected when we set the SO_REUSEPORT socket option and we have 2 sockets sharing the same source and destination address and port combinations. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Fixes: 402e23b4 ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport") Cc: stable@vger.kernel.org # v4.0+
-
- Jul 24, 2016
-
-
Trond Myklebust authored
Fix the report: net/sunrpc/clnt.c:2580:1: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Jul 19, 2016
-
-
kbuild test robot authored
net/sunrpc/xprtrdma/verbs.c:798:2-3: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci CC: Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Fengguang Wu <fengguang.wu@intel.com> Reviewed-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Frank Sorenson authored
The current min/max resvport settings are independently limited by the entire range of allowed ports, so max_resvport can be set to a port lower than min_resvport. Prevent inversion of min/max values when set through sysfs and module parameter by setting the limits dependent on each other. Signed-off-by:
Frank Sorenson <sorenson@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Frank Sorenson authored
The current min/max resvport settings are independently limited by the entire range of allowed ports, so max_resvport can be set to a port lower than min_resvport. Prevent inversion of min/max values when set through sysctl by setting the limits dependent on each other. Signed-off-by:
Frank Sorenson <sorenson@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Frank Sorenson authored
The range calculation for choosing the random reserved port will panic with divide-by-zero when min_resvport == max_resvport, a range of one port, not zero. Fix the reserved port range calculation by adding one to the difference. Signed-off-by:
Frank Sorenson <sorenson@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Frank Sorenson authored
Author: Frank Sorenson <sorenson@redhat.com> Date: 2016-06-27 13:55:48 -0500 sunrpc: Fix bit count when setting hashtable size to power-of-two The hashtable size is incorrectly calculated as the next higher power-of-two when being set to a power-of-two. fls() returns the bit number of the most significant set bit, with the least significant bit being numbered '1'. For a power-of-two, fls() will return a bit number which is one higher than the number of bits required, leading to a hashtable which is twice the requested size. In addition, the value of (1 << nbits) will always be at least num, so the test will never be true. Fix the hash table size calculation to correctly set hashtable size, and eliminate the unnecessary check. Signed-off-by:
Frank Sorenson <sorenson@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
Scott Mayhew authored
A generic_cred can be used to look up a unx_cred or a gss_cred, so it's not really safe to use the the generic_cred->acred->ac_flags to store the NO_CRKEY_TIMEOUT flag. A lookup for a unx_cred triggered while the KEY_EXPIRE_SOON flag is already set will cause both NO_CRKEY_TIMEOUT and KEY_EXPIRE_SOON to be set in the ac_flags, leaving the user associated with the auth_cred to be in a state where they're perpetually doing 4K NFS_FILE_SYNC writes. This can be reproduced as follows: 1. Mount two NFS filesystems, one with sec=krb5 and one with sec=sys. They do not need to be the same export, nor do they even need to be from the same NFS server. Also, v3 is fine. $ sudo mount -o v3,sec=krb5 server1:/export /mnt/krb5 $ sudo mount -o v3,sec=sys server2:/export /mnt/sys 2. As the normal user, before accessing the kerberized mount, kinit with a short lifetime (but not so short that renewing the ticket would leave you within the 4-minute window again by the time the original ticket expires), e.g. $ kinit -l 10m -r 60m 3. Do some I/O to the kerberized mount and verify that the writes are wsize, UNSTABLE: $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1 4. Wait until you're within 4 minutes of key expiry, then do some more I/O to the kerberized mount to ensure that RPC_CRED_KEY_EXPIRE_SOON gets set. Verify that the writes are 4K, FILE_SYNC: $ dd if=/dev/zero of=/mnt/krb5/file bs=1M count=1 5. Now do some I/O to the sec=sys mount. This will cause RPC_CRED_NO_CRKEY_TIMEOUT to be set: $ dd if=/dev/zero of=/mnt/sys/file bs=1M count=1 6. Writes for that user will now be permanently 4K, FILE_SYNC for that user, regardless of which mount is being written to, until you reboot the client. Renewing the kerberos ticket (assuming it hasn't already expired) will have no effect. Grabbing a new kerberos ticket at this point will have no effect either. Move the flag to the auth->au_flags field (which is currently unused) and rename it slightly to reflect that it's no longer associated with the auth_cred->ac_flags. Add the rpc_auth to the arg list of rpcauth_cred_key_to_expire and check the au_flags there too. Finally, add the inode to the arg list of nfs_ctx_key_to_expire so we can determine the rpc_auth to pass to rpcauth_cred_key_to_expire. Signed-off-by:
Scott Mayhew <smayhew@redhat.com> Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Jul 16, 2016
-
-
Trond Myklebust authored
If there were less than 2 entries in the multipath list, then xprt_iter_next_entry_multiple() would never advance beyond the first entry, which is correct for round robin behaviour, but not for the list iteration. The end result would be infinite looping in rpc_clnt_iterate_for_each_xprt() as we would never see the xprt == NULL condition fulfilled. Reported-by:
Oleg Drokin <green@linuxhacker.ru> Fixes: 80b14d5e ("SUNRPC: Add a structure to track multiple transports") Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com>
-
- Jul 13, 2016
-
-
Trond Myklebust authored
Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
The current server rpc tcp code attempts to predict how much writeable socket space will be available to a given RPC call before accepting it for processing. On a 40GigE network, we've found this throttles individual clients long before the network or disk is saturated. The server may handle more clients easily, but the bandwidth of individual clients is still artificially limited. Instead of trying (and failing) to predict how much writeable socket space will be available to the RPC call, just fall back to the simple model of deferring processing until the socket is uncongested. This may increase the risk of fast clients starving slower clients; in such cases, the previous patch allows setting a hard per-connection limit. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Allow the user to limit the number of requests serviced through a single connection, to help prevent faster clients from starving slower clients. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Don't call svc_xprt_enqueue() if the XPT_DATA flag is already set. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Rather than code up our own versions of the socket callbacks, just call the defaults. This also allows us to merge svc_udp_data_ready() and svc_tcp_data_ready(). Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Prevent callbacks from triggering while we're detaching the socket. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Dropping and/or deferring requests has an impact on performance. Let's make sure we can trace those events. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Trond Myklebust authored
Add a tracepoint to track when the processing of incoming RPC data gets deferred due to out-of-space issues on the outgoing transport. Signed-off-by:
Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
Scott Mayhew authored
GSS-Proxy doesn't produce very much debug logging at all. Printing out the gss minor status will aid in troubleshooting if the GSS_Accept_sec_context upcall fails. Signed-off-by:
Scott Mayhew <smayhew@redhat.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
NeilBrown authored
This field is not currently in use. Signed-off-by:
NeilBrown <neilb@suse.com> Signed-off-by:
J. Bruce Fields <bfields@redhat.com>
-
- Jul 11, 2016
-
-
Chuck Lever authored
Before commit 778be232 ("NFS do not find client in NFSv4 pg_authenticate"), the Linux callback server replied with RPC_AUTH_ERROR / RPC_AUTH_BADCRED, instead of dropping the CB request. Let's restore that behavior so the server has a chance to do something useful about it, and provide a warning that helps admins correct the problem. Fixes: 778be232 ("NFS do not find client in NFSv4 ...") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
If an RPC program does not set vs_dispatch and pc_func() returns rpc_drop_reply, the server sends a reply anyway containing a single word containing the value RPC_DROP_REPLY (in network byte-order, of course). This is a nonsense RPC message. Fixes: 9e701c61 ("svcrpc: simpler request dropping") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Direct data placement is not allowed when using flavors that guarantee integrity or privacy. When such security flavors are in effect, don't allow the use of Read and Write chunks for moving individual data items. All messages larger than the inline threshold are sent via Long Call or Long Reply. On my systems (CX-3 Pro on FDR), for small I/O operations, the use of Long messages adds only around 5 usecs of latency in each direction. Note that when integrity or encryption is used, the host CPU touches every byte in these messages. Even if it could be used, data movement offload doesn't buy much in this case. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
fixup_copy_count should count only the number of bytes copied to the page list. The head and tail are now always handled without a data copy. And the debugging at the end of rpcrdma_inline_fixup() is also no longer necessary, since copy_len will be non-zero when there is reply data in the tail (a normal and valid case). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Now that rpcrdma_inline_fixup() updates only two fields in rq_rcv_buf, a full memcpy of that structure to rq_private_buf is unwarranted. Updating rq_private_buf fields only where needed also better documents what is going on. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
While trying NFSv4.0/RDMA with sec=krb5p, I noticed small NFS READ operations failed. After the client unwrapped the NFS READ reply message, the NFS READ XDR decoder was not able to decode the reply. The message was "Server cheating in reply", with the reported number of received payload bytes being zero. Applications reported a read(2) that returned -1/EIO. The problem is rpcrdma_inline_fixup() sets the tail.iov_len to zero when the incoming reply fits entirely in the head iovec. The zero tail.iov_len confused xdr_buf_trim(), which then mangled the actual reply data instead of simply removing the trailing GSS checksum. As near as I can tell, RPC transports are not supposed to update the head.iov_len, page_len, or tail.iov_len fields in the receive XDR buffer when handling an incoming RPC reply message. These fields contain the length of each component of the XDR buffer, and hence the maximum number of bytes of reply data that can be stored in each XDR buffer component. I've concluded this because: - This is how xdr_partial_copy_from_skb() appears to behave - rpcrdma_inline_fixup() already does not alter page_len - call_decode() compares rq_private_buf and rq_rcv_buf and WARNs if they are not exactly the same Unfortunately, as soon as I tried the simple fix to just remove the line that sets tail.iov_len to zero, I saw that the logic that appends the implicit Write chunk pad inline depends on inline_fixup setting tail.iov_len to zero. To address this, re-organize the tail iovec handling logic to use the same approach as with the head iovec: simply point tail.iov_base to the correct bytes in the receive buffer. While I remember all this, write down the conclusion in documenting comments. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
When the remaining length of an incoming reply is longer than the XDR buf's page_len, switch over to the tail iovec instead of copying more than page_len bytes into the page list. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Currently, all three chunk list encoders each use a portion of the one rl_segments array in rpcrdma_req. This is because the MWs for each chunk list were preserved in rl_segments so that ro_unmap could find and invalidate them after the RPC was complete. However, now that MWs are placed on a per-req linked list as they are registered, there is no longer any information in rpcrdma_mr_seg that is shared between ro_map and ro_unmap_{sync,safe}, and thus nothing in rl_segments needs to be preserved after rpcrdma_marshal_req is complete. Thus the rl_segments array can be used now just for the needs of each rpcrdma_convert_iovs call. Once each chunk list is encoded, the next chunk list encoder is free to re-use all of rl_segments. This means all three chunk lists in one RPC request can now each encode a full size data payload with no increase in the size of rl_segments. This is a key requirement for Kerberos support, since both the Call and Reply for a single RPC transaction are conveyed via Long messages (RDMA Read/Write). Both can be large. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Instead of placing registered MWs sparsely into the rl_segments array, place these MWs on a per-req list. ro_unmap_{sync,safe} can then simply pull those MWs off the list instead of walking through the array. This change significantly reduces the size of struct rpcrdma_req by removing nsegs and rl_mw from every array element. As an additional clean-up, chunk co-ordinates are returned in the "*mw" output argument so they are no longer needed in every array element. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Instead of leaving orphaned MRs to be released when the transport is destroyed, release them immediately. The MR free list can now be replenished if it becomes exhausted. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Tested-by:
Steve Wise <swise@opengridcomputing.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-