Commits · 35d8b10a25884050bb3b0149b62c3818ec59f77c · jan.koester / Linux

Apr 26, 2021

xprtrdma: Fix cwnd update ordering · 35d8b10a

Chuck Lever authored Apr 19, 2021

After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().

Fixes: 2ae50ad6 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

35d8b10a

xprtrdma: Improve locking around rpcrdma_rep creation · 9e3ca33b

Chuck Lever authored Apr 19, 2021



Defensive clean up: Protect the rb_all_reps list during rep
creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

9e3ca33b

xprtrdma: Improve commentary around rpcrdma_reps_unmap() · 8b5292be

Chuck Lever authored Apr 19, 2021



Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

8b5292be

xprtrdma: Improve locking around rpcrdma_rep destruction · eaf86e8c

Chuck Lever authored Apr 24, 2021



Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.

Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

eaf86e8c

xprtrdma: Put flushed Receives on free list instead of destroying them · 5030c9a9

Chuck Lever authored Apr 19, 2021



Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

5030c9a9

xprtrdma: Do not refresh Receive Queue while it is draining · 15788d1d

Chuck Lever authored Apr 19, 2021



Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.

Suggested-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

15788d1d

xprtrdma: Avoid Receive Queue wrapping · 32e6b681

Chuck Lever authored Apr 19, 2021



Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.

This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.

Fixes: e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

32e6b681

Apr 14, 2021

SUNRPC: Move fault injection call sites · 7638e0bf

Chuck Lever authored Mar 31, 2021



I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.

The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.

Fixes: 4a068258 ("SUNRPC: Transport fault injection")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

7638e0bf

Mar 11, 2021

svcrdma: Revert "svcrdma: Reduce Receive doorbell rate" · bade4be6

Chuck Lever authored Mar 11, 2021



I tested commit 43042b90 ("svcrdma: Reduce Receive doorbell
rate") with mlx4 (IB) and software iWARP and didn't find any
issues. However, I recently got my hardware iWARP setup back on
line (FastLinQ) and it's crashing hard on this commit (confirmed
via bisect).

The failure mode is complex.
 - After a connection is established, the first Receive completes
   normally.
 - But the second and third Receives have garbage in their Receive
   buffers. The server responds with ERR_VERS as a result.
 - When the client tears down the connection to retry, a couple
   of posted Receives flush twice, and that corrupts the recv_ctxt
   free list.
 - __svc_rdma_free then faults or loops infinitely while destroying
   the xprt's recv_ctxts.

Since 43042b90 ("svcrdma: Reduce Receive doorbell rate") does
not fix a bug but is a scalability enhancement, it's safe and
appropriate to revert it while working on a replacement.

Fixes: 43042b90 ("svcrdma: Reduce Receive doorbell rate")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bade4be6

Mar 06, 2021

svcrdma: disable timeouts on rdma backchannel · 6820bf77

Timo Rothenpieler authored Feb 23, 2021



This brings it in line with the regular tcp backchannel, which also has
all those timeouts disabled.

Prevents the backchannel from timing out, getting some async operations
like server side copying getting stuck indefinitely on the client side.

Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
Fixes: 5d252f90 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6820bf77

Feb 15, 2021

svcrdma: Hold private mutex while invoking rdma_accept() · 0ac24c32

Chuck Lever authored Feb 09, 2021



RDMA core mutex locking was restructured by commit d114c6fe
("RDMA/cma: Add missing locking to rdma_accept()") [Aug 2020]. When
lock debugging is enabled, the RPC/RDMA server trips over the new
lockdep assertion in rdma_accept() because it doesn't call
rdma_accept() from its CM event handler.

As a temporary fix, have svc_rdma_accept() take the handler_mutex
explicitly. In the meantime, let's consider how to restructure the
RPC/RDMA transport to invoke rdma_accept() from the proper context.

Calls to svc_rdma_accept() are serialized with calls to
svc_rdma_free() by the generic RPC server layer.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-rdma/20210209154014.GO4247@nvidia.com/


Fixes: d114c6fe ("RDMA/cma: Add missing locking to rdma_accept()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

0ac24c32

Feb 05, 2021

xprtrdma: Clean up rpcrdma_prepare_readch() · 586a0787

Chuck Lever authored Feb 05, 2021



Since commit 9ed5af26 ("SUNRPC: Clean up the handling of page
padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client
passes payload data to the transport with the padding in xdr->pages
instead of in the send buffer's tail kvec. There's no need for the
extra logic to advance the base of the tail kvec because the upper
layer no longer places XDR padding there.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

586a0787

xprtrdma: Pad optimization, revisited · 2324fbed

Chuck Lever authored Feb 04, 2021



The NetApp Linux team discovered that with NFS/RDMA servers that do
not support RFC 8797, the Linux client is forming NFSv4.x WRITE
requests incorrectly.

In this case, the Linux NFS client disables implicit chunk round-up
for odd-length Read and Write chunks. The goal was to support old
servers that needed that padding to be sent explicitly by clients.

In that case the Linux NFS included the tail kvec in the Read chunk,
since the tail contains any needed padding. That meant a separate
memory registration is needed for the tail kvec, adding to the cost
of forming such requests. To avoid that cost for a mere 3 bytes of
zeroes that are always ignored by receivers, we try to use implicit
roundup when possible.

For NFSv4.x, the tail kvec also sometimes contains a trailing
GETATTR operation. The Linux NFS client unintentionally includes
that GETATTR operation in the Read chunk as well as inline.

The fix is simply to /never/ include the tail kvec when forming a
data payload Read chunk. The padding is thus now always present.

Note that since commit 9ed5af26 ("SUNRPC: Clean up the handling
of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS
client passes payload data to the transport with the padding in
xdr->pages instead of in the send buffer's tail kvec. So now the
Linux NFS client appends XDR padding to all odd-sized Read chunks.
This shouldn't be a problem because:

 - RFC 8166-compliant servers are supposed to work with or without
   that XDR padding in Read chunks.

 - Since the padding is now in the same memory region as the data
   payload, a separate memory registration is not needed. In
   addition, the link layer extends data in RDMA Read responses to
   4-byte boundaries anyway. Thus there is now no savings when the
   padding is not included.

Because older kernels include the payload's XDR padding in the
tail kvec, a fix there will be more complicated. Thus backporting
this patch is not recommended.

Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

2324fbed

rpcrdma: Fix comments about reverse-direction operation · 84dff5eb

Chuck Lever authored Feb 04, 2021



During the final stages of publication of RFC 8167, reviewers
requested that we use the term "reverse direction" rather than
"backwards direction". Update comments to reflect this preference.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

84dff5eb

xprtrdma: Refactor invocations of offset_in_page() · 67b16625

Chuck Lever authored Feb 04, 2021



Clean up so that offset_in_page() is invoked less often in the
most common case, which is mapping xdr->pages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

67b16625

xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map() · 54e6aec5

Chuck Lever authored Feb 04, 2021



Clean up.

Remove a conditional branch from the SGL set-up loop in frwr_map():
Instead of using either sg_set_page() or sg_set_buf(), initialize
the mr_page field properly when rpcrdma_convert_kvec() converts the
kvec to an SGL entry. frwr_map() can then invoke sg_set_page()
unconditionally.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

54e6aec5

xprtrdma: Remove FMR support in rpcrdma_convert_iovs() · 9929f4ad

Chuck Lever authored Feb 04, 2021



Support for FMR was removed by commit ba69cd12 ("xprtrdma:
Remove support for FMR memory registration") [Dec 2018]. That means
the buffer-splitting behavior of rpcrdma_convert_kvec(), added by
commit 821c791a ("xprtrdma: Segment head and tail XDR buffers
on page boundaries") [Mar 2016], is no longer necessary. FRWR
memory registration handles this case with aplomb.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>

9929f4ad

Jan 25, 2021

svcrdma: DMA-sync the receive buffer in svc_rdma_recvfrom() · dd2d055b

Chuck Lever authored Dec 30, 2020



The Receive completion handler doesn't look at the contents of the
Receive buffer. The DMA sync isn't terribly expensive but it's one
less thing that needs to be done by the Receive completion handler,
which is single-threaded (per svc_xprt). This helps scalability.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

dd2d055b

svcrdma: Reduce Receive doorbell rate · 43042b90

Chuck Lever authored Dec 08, 2020



This is similar to commit e340c2d6 ("xprtrdma: Reduce the
doorbell rate (Receive)") which added Receive batching to the
client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

43042b90

svcrdma: Deprecate stat variables that are no longer used · c6226ff9

Chuck Lever authored Dec 29, 2020



Clean up. We are not permitted to remove old proc files. Instead,
convert these variables to stubs that are only ever allowed to
display a value of zero.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

c6226ff9

svcrdma: Restore read and write stats · 1e7e5573

Chuck Lever authored Dec 29, 2020



Now that we have an efficient mechanism to update these two stats,
let's start maintaining them again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

1e7e5573

svcrdma: Convert rdma_stat_sq_starve to a per-CPU counter · 22df5a22

Chuck Lever authored Dec 29, 2020



Avoid the overhead of a memory bus lock cycle for counting a value
that is hardly every used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

22df5a22

svcrdma: Convert rdma_stat_recv to a per-CPU counter · df971cd8

Chuck Lever authored Dec 29, 2020



Receives are frequent events. Avoid the overhead of a memory bus
lock cycle for counting a value that is hardly every used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

df971cd8

svcrdma: Refactor svc_rdma_init() and svc_rdma_clean_up() · 59a00257

Chuck Lever authored Dec 29, 2020



Setting up the proc variables is about to get more complicated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

59a00257

Dec 14, 2020

xprtrdma: Fix XDRBUF_SPARSE_PAGES support · 15261b91

Chuck Lever authored Dec 08, 2020



Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
only when it has determined that a Reply chunk is necessary. There
are plenty of cases where no Reply chunk is needed, but the
XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
rpcrdma_inline_fixup() when it tries to copy parts of the received
Reply into a missing page.

To avoid crashing, handle sparse page allocation up front.

Until XATTR support was added, this issue did not appear often
because the only SPARSE_PAGES consumer always expected a reply large
enough to always require a Reply chunk.

Reported-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

15261b91

Dec 02, 2020

SUNRPC: xprt_load_transport() needs to support the netid "rdma6" · d5aa6b22

Trond Myklebust authored Nov 06, 2020

According to RFC5666, the correct netid for an IPv6 addressed RDMA
transport is "rdma6", which we've supported as a mount option since
Linux-4.7. The problem is when we try to load the module "xprtrdma6",
that will fail, since there is no modulealias of that name.

Fixes: 181342c5 ("xprtrdma: Add rdma6 option to support NFS/RDMA IPv6")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>

d5aa6b22

Nov 30, 2020

svcrdma: support multiple Read chunks per RPC · d7cc7397

Chuck Lever authored Aug 05, 2020



An efficient way to handle multiple Read chunks is to post them all
together and then take a single completion. This is also how the
code is already structured: when the Read completion fires, all
portions of the incoming RPC message are available to be assembled.

The difficult problem is setting up the Read sink buffers so that
the server pulls the client's data into place, making subsequent
pull-up unnecessary. There are several cases:

* No Read chunks. No-op.

* One data item Read chunk. This is the fast case, where the inline
  part of the RPC-over-RDMA message becomes the head and tail, and
  the data item chunk is placed in buf->pages.

* A Position-zero Read chunk. Treated like TCP: the Read chunk is
  pulled into contiguous pages.

+ A Position-zero Read chunk with data item chunks. Treated like
  TCP: all of the Read chunks are pulled into contiguous pages.

+ Multiple data item chunks. Treated like TCP: the inline part is
  copied and the data item chunks are pulled into contiguous pages.

The "*" cases are already supported. This patch adds support for the
"+" cases.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d7cc7397

svcrdma: Use the new parsed chunk list when pulling Read chunks · d96962e6

Chuck Lever authored Sep 17, 2020



As a pre-requisite for handling multiple Read chunks in each Read
list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
chunk list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

d96962e6

svcrdma: Rename info::ri_chunklen · bafe9c27

Chuck Lever authored Jul 17, 2020



I'm about to change the purpose of ri_chunklen: Instead of tracking
the number of bytes in one Read chunk, it will track the total
number of bytes in the Read list. Rename it for clarity.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

bafe9c27

svcrdma: Clean up chunk tracepoints · b704be09

Chuck Lever authored Jun 11, 2020

We already have trace_svcrdma_decode_rseg(), which records each
ingress Read segment. Instead of reporting those again when they
are about to be posted as RDMA Reads, let's fire one tracepoint
before posting each type of chunk.

So we'll get:

nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
nfsd-1998 [002] 321.666615: svcrdma_decode_rseg: cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
nfsd-1998 [002] 321.666622: svcrdma_decode_rqst: cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100

nfsd-1998 [002] 321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3

kworker/2:1H-221 [002] 321.673949: svcrdma_wc_read: cq.id=3 cid=112 status=SUCCESS (0/0x0)

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

b704be09

svcrdma: Remove chunk list pointers · 7954c850

Chuck Lever authored Jun 17, 2020



Clean up: These pointers are no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7954c850

svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunk · 41bc163f

Chuck Lever authored Mar 09, 2020



Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
of rq_res that do not contain a result payload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

41bc163f

svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg() · 2371bcc0

Chuck Lever authored Mar 09, 2020



Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
the parts of rq_res that do not contain a result payload.

This change has been tested to confirm that it does not cause a
regression in the no Write chunk and single Write chunk cases.
Multiple Write chunks have not been tested.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

2371bcc0

svcrdma: Support multiple write chunks when pulling up · 9d0b09d5

Chuck Lever authored Mar 13, 2020



When counting the number of SGEs needed to construct a Send request,
do not count result payloads. And, when copying the Reply message
into the pull-up buffer, result payloads are not to be copied to the
Send buffer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

9d0b09d5

svcrdma: Use parsed chunk lists to encode Reply transport headers · 6911f3e1

Chuck Lever authored Jun 17, 2020



Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing the egress RPC Reply transport header, use
the new parsed Write list and Reply chunk, which are version-
agnostic and already XDR decoded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

6911f3e1

svcrdma: Use parsed chunk lists to construct RDMA Writes · 7a1cbfa1

Chuck Lever authored Jun 17, 2020



Refactor: Instead of re-parsing the ingress RPC Call transport
header when constructing RDMA Writes, use the new parsed chunk lists
for the Write list and Reply chunk, which are version-agnostic and
already XDR-decoded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

7a1cbfa1

svcrdma: Use parsed chunk lists to detect reverse direction replies · 58b2e0fe

Chuck Lever authored Mar 22, 2020



Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.

Note that the XID sanity test is also removed. The XID is already
looked up by the cb handler, and is rejected if it's not recognized.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

58b2e0fe

svcrdma: Use parsed chunk lists to derive the inv_rkey · eb3de6a4

Chuck Lever authored Jun 22, 2020



Refactor: Don't duplicate header decoding smarts here. Instead, use
the new parsed chunk lists.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

eb3de6a4

svcrdma: Add a "parsed chunk list" data structure · 78147ca8

Chuck Lever authored Jun 22, 2020



This simple data structure binds the location of each data payload
inside of an RPC message to the chunk that will be used to push it
to or pull it from the client.

There are several benefits to this small additional overhead:

 * It enables support for more than one chunk in incoming Read and
   Write lists.

 * It translates the version-specific on-the-wire format into a
   generic in-memory structure, enabling support for multiple
   versions of the RPC/RDMA transport protocol.

 * It enables the server to re-organize a chunk list if it needs to
   adjust where Read chunk data lands in server memory without
   altering the contents of the XDR-encoded Receive buffer.

Construction of these lists is done while sanity checking each
incoming RPC/RDMA header. Subsequent patches will make use of the
generated data structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

78147ca8

svcrdma: Clean up svc_rdma_encode_reply_chunk() · ded380f1

Chuck Lever authored Mar 13, 2020



Refactor: Match the control flow of svc_rdma_encode_write_list().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

ded380f1