- Apr 26, 2021
-
-
Chuck Lever authored
After a reconnect, the reply handler is opening the cwnd (and thus enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs() can post enough Receive WRs to receive their replies. This causes an RNR and the new connection is lost immediately. The race is most clearly exposed when KASAN and disconnect injection are enabled. This slows down rpcrdma_rep_create() enough to allow the send side to post a bunch of RPC Calls before the Receive completion handler can invoke ib_post_recv(). Fixes: 2ae50ad6 ("xprtrdma: Close window between waking RPC senders and posting Receives") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Defensive clean up: Protect the rb_all_reps list during rep creation. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Currently rpcrdma_reps_destroy() assumes that, at transport tear-down, the content of the rb_free_reps list is the same as the content of the rb_all_reps list. Although that is usually true, using the rb_all_reps list should be more reliable because of the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps; these two functions should both traverse the "all" list. Ensure that all rpcrdma_reps are always destroyed whether they are on the rep free list or not. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Defer destruction of an rpcrdma_rep until transport tear-down to preserve the rb_all_reps list while Receives flush. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Currently the Receive completion handler refreshes the Receive Queue whenever a successful Receive completion occurs. On disconnect, xprtrdma drains the Receive Queue. The first few Receive completions after a disconnect are typically successful, until the first flushed Receive. This means the Receive completion handler continues to post more Receive WRs after the drain sentinel has been posted. The late- posted Receives flush after the drain sentinel has completed, leading to a crash later in rpcrdma_xprt_disconnect(). To prevent this crash, xprtrdma has to ensure that the Receive handler stops posting Receives before ib_drain_rq() posts its drain sentinel. Suggested-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
Chuck Lever authored
Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)") increased the number of Receive WRs that are posted by the client, but did not increase the size of the Receive Queue allocated during transport set-up. This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't, there is a real risk of Receive Queue wrapping. Fixes: e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
- Apr 14, 2021
-
-
Chuck Lever authored
I've hit some crashes that occur in the xprt_rdma_inject_disconnect path. It appears that, for some provides, rdma_disconnect() can take so long that the transport can disconnect and release its hardware resources while rdma_disconnect() is still running, resulting in a UAF in the provider. The transport's fault injection method may depend on the stability of transport data structures. That means it needs to be invoked only from contexts that hold the transport write lock. Fixes: 4a068258 ("SUNRPC: Transport fault injection") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
- Mar 11, 2021
-
-
Chuck Lever authored
I tested commit 43042b90 ("svcrdma: Reduce Receive doorbell rate") with mlx4 (IB) and software iWARP and didn't find any issues. However, I recently got my hardware iWARP setup back on line (FastLinQ) and it's crashing hard on this commit (confirmed via bisect). The failure mode is complex. - After a connection is established, the first Receive completes normally. - But the second and third Receives have garbage in their Receive buffers. The server responds with ERR_VERS as a result. - When the client tears down the connection to retry, a couple of posted Receives flush twice, and that corrupts the recv_ctxt free list. - __svc_rdma_free then faults or loops infinitely while destroying the xprt's recv_ctxts. Since 43042b90 ("svcrdma: Reduce Receive doorbell rate") does not fix a bug but is a scalability enhancement, it's safe and appropriate to revert it while working on a replacement. Fixes: 43042b90 ("svcrdma: Reduce Receive doorbell rate") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
- Mar 06, 2021
-
-
Timo Rothenpieler authored
This brings it in line with the regular tcp backchannel, which also has all those timeouts disabled. Prevents the backchannel from timing out, getting some async operations like server side copying getting stuck indefinitely on the client side. Signed-off-by:
Timo Rothenpieler <timo@rothenpieler.org> Fixes: 5d252f90 ("svcrdma: Add class for RDMA backwards direction transport") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
- Feb 15, 2021
-
-
Chuck Lever authored
RDMA core mutex locking was restructured by commit d114c6fe ("RDMA/cma: Add missing locking to rdma_accept()") [Aug 2020]. When lock debugging is enabled, the RPC/RDMA server trips over the new lockdep assertion in rdma_accept() because it doesn't call rdma_accept() from its CM event handler. As a temporary fix, have svc_rdma_accept() take the handler_mutex explicitly. In the meantime, let's consider how to restructure the RPC/RDMA transport to invoke rdma_accept() from the proper context. Calls to svc_rdma_accept() are serialized with calls to svc_rdma_free() by the generic RPC server layer. Suggested-by:
Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/linux-rdma/20210209154014.GO4247@nvidia.com/ Fixes: d114c6fe ("RDMA/cma: Add missing locking to rdma_accept()") Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
- Feb 05, 2021
-
-
Chuck Lever authored
Since commit 9ed5af26 ("SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client passes payload data to the transport with the padding in xdr->pages instead of in the send buffer's tail kvec. There's no need for the extra logic to advance the base of the tail kvec because the upper layer no longer places XDR padding there. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
The NetApp Linux team discovered that with NFS/RDMA servers that do not support RFC 8797, the Linux client is forming NFSv4.x WRITE requests incorrectly. In this case, the Linux NFS client disables implicit chunk round-up for odd-length Read and Write chunks. The goal was to support old servers that needed that padding to be sent explicitly by clients. In that case the Linux NFS included the tail kvec in the Read chunk, since the tail contains any needed padding. That meant a separate memory registration is needed for the tail kvec, adding to the cost of forming such requests. To avoid that cost for a mere 3 bytes of zeroes that are always ignored by receivers, we try to use implicit roundup when possible. For NFSv4.x, the tail kvec also sometimes contains a trailing GETATTR operation. The Linux NFS client unintentionally includes that GETATTR operation in the Read chunk as well as inline. The fix is simply to /never/ include the tail kvec when forming a data payload Read chunk. The padding is thus now always present. Note that since commit 9ed5af26 ("SUNRPC: Clean up the handling of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client passes payload data to the transport with the padding in xdr->pages instead of in the send buffer's tail kvec. So now the Linux NFS client appends XDR padding to all odd-sized Read chunks. This shouldn't be a problem because: - RFC 8166-compliant servers are supposed to work with or without that XDR padding in Read chunks. - Since the padding is now in the same memory region as the data payload, a separate memory registration is not needed. In addition, the link layer extends data in RDMA Read responses to 4-byte boundaries anyway. Thus there is now no savings when the padding is not included. Because older kernels include the payload's XDR padding in the tail kvec, a fix there will be more complicated. Thus backporting this patch is not recommended. Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com> Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
During the final stages of publication of RFC 8167, reviewers requested that we use the term "reverse direction" rather than "backwards direction". Update comments to reflect this preference. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up so that offset_in_page() is invoked less often in the most common case, which is mapping xdr->pages. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Clean up. Remove a conditional branch from the SGL set-up loop in frwr_map(): Instead of using either sg_set_page() or sg_set_buf(), initialize the mr_page field properly when rpcrdma_convert_kvec() converts the kvec to an SGL entry. frwr_map() can then invoke sg_set_page() unconditionally. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Tom Talpey <tom@talpey.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
Chuck Lever authored
Support for FMR was removed by commit ba69cd12 ("xprtrdma: Remove support for FMR memory registration") [Dec 2018]. That means the buffer-splitting behavior of rpcrdma_convert_kvec(), added by commit 821c791a ("xprtrdma: Segment head and tail XDR buffers on page boundaries") [Mar 2016], is no longer necessary. FRWR memory registration handles this case with aplomb. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Signed-off-by:
Anna Schumaker <Anna.Schumaker@Netapp.com>
-
- Jan 25, 2021
-
-
Chuck Lever authored
The Receive completion handler doesn't look at the contents of the Receive buffer. The DMA sync isn't terribly expensive but it's one less thing that needs to be done by the Receive completion handler, which is single-threaded (per svc_xprt). This helps scalability. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Reviewed-by:
Christoph Hellwig <hch@lst.de>
-
Chuck Lever authored
This is similar to commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)") which added Receive batching to the client. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Clean up. We are not permitted to remove old proc files. Instead, convert these variables to stubs that are only ever allowed to display a value of zero. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Now that we have an efficient mechanism to update these two stats, let's start maintaining them again. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Receives are frequent events. Avoid the overhead of a memory bus lock cycle for counting a value that is hardly every used. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Setting up the proc variables is about to get more complicated. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
- Dec 14, 2020
-
-
Chuck Lever authored
Olga K. observed that rpcrdma_marsh_req() allocates sparse pages only when it has determined that a Reply chunk is necessary. There are plenty of cases where no Reply chunk is needed, but the XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in rpcrdma_inline_fixup() when it tries to copy parts of the received Reply into a missing page. To avoid crashing, handle sparse page allocation up front. Until XATTR support was added, this issue did not appear often because the only SPARSE_PAGES consumer always expected a reply large enough to always require a Reply chunk. Reported-by:
Olga Kornievskaia <kolga@netapp.com> Signed-off-by:
Chuck Lever <chuck.lever@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
- Dec 02, 2020
-
-
Trond Myklebust authored
According to RFC5666, the correct netid for an IPv6 addressed RDMA transport is "rdma6", which we've supported as a mount option since Linux-4.7. The problem is when we try to load the module "xprtrdma6", that will fail, since there is no modulealias of that name. Fixes: 181342c5 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6") Signed-off-by:
Trond Myklebust <trond.myklebust@hammerspace.com>
-
- Nov 30, 2020
-
-
Chuck Lever authored
An efficient way to handle multiple Read chunks is to post them all together and then take a single completion. This is also how the code is already structured: when the Read completion fires, all portions of the incoming RPC message are available to be assembled. The difficult problem is setting up the Read sink buffers so that the server pulls the client's data into place, making subsequent pull-up unnecessary. There are several cases: * No Read chunks. No-op. * One data item Read chunk. This is the fast case, where the inline part of the RPC-over-RDMA message becomes the head and tail, and the data item chunk is placed in buf->pages. * A Position-zero Read chunk. Treated like TCP: the Read chunk is pulled into contiguous pages. + A Position-zero Read chunk with data item chunks. Treated like TCP: all of the Read chunks are pulled into contiguous pages. + Multiple data item chunks. Treated like TCP: the inline part is copied and the data item chunks are pulled into contiguous pages. The "*" cases are already supported. This patch adds support for the "+" cases. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
As a pre-requisite for handling multiple Read chunks in each Read list, convert svc_rdma_recv_read_chunk() to use the new parsed Read chunk list. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
I'm about to change the purpose of ri_chunklen: Instead of tracking the number of bytes in one Read chunk, it will track the total number of bytes in the Read list. Rename it for clarity. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
We already have trace_svcrdma_decode_rseg(), which records each ingress Read segment. Instead of reporting those again when they are about to be posted as RDMA Reads, let's fire one tracepoint before posting each type of chunk. So we'll get: nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05 nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905 nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100 nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3 kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0) Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Clean up: These pointers are no longer used. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts of rq_res that do not contain a result payload. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only the parts of rq_res that do not contain a result payload. This change has been tested to confirm that it does not cause a regression in the no Write chunk and single Write chunk cases. Multiple Write chunks have not been tested. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
When counting the number of SGEs needed to construct a Send request, do not count result payloads. And, when copying the Reply message into the pull-up buffer, result payloads are not to be copied to the Send buffer. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing the egress RPC Reply transport header, use the new parsed Write list and Reply chunk, which are version- agnostic and already XDR decoded. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: Instead of re-parsing the ingress RPC Call transport header when constructing RDMA Writes, use the new parsed chunk lists for the Write list and Reply chunk, which are version-agnostic and already XDR-decoded. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Note that the XID sanity test is also removed. The XID is already looked up by the cb handler, and is rejected if it's not recognized. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: Don't duplicate header decoding smarts here. Instead, use the new parsed chunk lists. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
This simple data structure binds the location of each data payload inside of an RPC message to the chunk that will be used to push it to or pull it from the client. There are several benefits to this small additional overhead: * It enables support for more than one chunk in incoming Read and Write lists. * It translates the version-specific on-the-wire format into a generic in-memory structure, enabling support for multiple versions of the RPC/RDMA transport protocol. * It enables the server to re-organize a chunk list if it needs to adjust where Read chunk data lands in server memory without altering the contents of the XDR-encoded Receive buffer. Construction of these lists is done while sanity checking each incoming RPC/RDMA header. Subsequent patches will make use of the generated data structures. Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-
Chuck Lever authored
Refactor: Match the control flow of svc_rdma_encode_write_list(). Signed-off-by:
Chuck Lever <chuck.lever@oracle.com>
-