Skip to content
  1. Apr 26, 2021
  2. Apr 14, 2021
    • Chuck Lever's avatar
      SUNRPC: Move fault injection call sites · 7638e0bf
      Chuck Lever authored
      
      
      I've hit some crashes that occur in the xprt_rdma_inject_disconnect
      path. It appears that, for some provides, rdma_disconnect() can
      take so long that the transport can disconnect and release its
      hardware resources while rdma_disconnect() is still running,
      resulting in a UAF in the provider.
      
      The transport's fault injection method may depend on the stability
      of transport data structures. That means it needs to be invoked
      only from contexts that hold the transport write lock.
      
      Fixes: 4a068258 ("SUNRPC: Transport fault injection")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      7638e0bf
  3. Mar 11, 2021
    • Chuck Lever's avatar
      svcrdma: Revert "svcrdma: Reduce Receive doorbell rate" · bade4be6
      Chuck Lever authored
      
      
      I tested commit 43042b90 ("svcrdma: Reduce Receive doorbell
      rate") with mlx4 (IB) and software iWARP and didn't find any
      issues. However, I recently got my hardware iWARP setup back on
      line (FastLinQ) and it's crashing hard on this commit (confirmed
      via bisect).
      
      The failure mode is complex.
       - After a connection is established, the first Receive completes
         normally.
       - But the second and third Receives have garbage in their Receive
         buffers. The server responds with ERR_VERS as a result.
       - When the client tears down the connection to retry, a couple
         of posted Receives flush twice, and that corrupts the recv_ctxt
         free list.
       - __svc_rdma_free then faults or loops infinitely while destroying
         the xprt's recv_ctxts.
      
      Since 43042b90 ("svcrdma: Reduce Receive doorbell rate") does
      not fix a bug but is a scalability enhancement, it's safe and
      appropriate to revert it while working on a replacement.
      
      Fixes: 43042b90 ("svcrdma: Reduce Receive doorbell rate")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      bade4be6
  4. Mar 06, 2021
  5. Feb 15, 2021
  6. Feb 05, 2021
    • Chuck Lever's avatar
      xprtrdma: Clean up rpcrdma_prepare_readch() · 586a0787
      Chuck Lever authored
      
      
      Since commit 9ed5af26 ("SUNRPC: Clean up the handling of page
      padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client
      passes payload data to the transport with the padding in xdr->pages
      instead of in the send buffer's tail kvec. There's no need for the
      extra logic to advance the base of the tail kvec because the upper
      layer no longer places XDR padding there.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      586a0787
    • Chuck Lever's avatar
      xprtrdma: Pad optimization, revisited · 2324fbed
      Chuck Lever authored
      
      
      The NetApp Linux team discovered that with NFS/RDMA servers that do
      not support RFC 8797, the Linux client is forming NFSv4.x WRITE
      requests incorrectly.
      
      In this case, the Linux NFS client disables implicit chunk round-up
      for odd-length Read and Write chunks. The goal was to support old
      servers that needed that padding to be sent explicitly by clients.
      
      In that case the Linux NFS included the tail kvec in the Read chunk,
      since the tail contains any needed padding. That meant a separate
      memory registration is needed for the tail kvec, adding to the cost
      of forming such requests. To avoid that cost for a mere 3 bytes of
      zeroes that are always ignored by receivers, we try to use implicit
      roundup when possible.
      
      For NFSv4.x, the tail kvec also sometimes contains a trailing
      GETATTR operation. The Linux NFS client unintentionally includes
      that GETATTR operation in the Read chunk as well as inline.
      
      The fix is simply to /never/ include the tail kvec when forming a
      data payload Read chunk. The padding is thus now always present.
      
      Note that since commit 9ed5af26 ("SUNRPC: Clean up the handling
      of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS
      client passes payload data to the transport with the padding in
      xdr->pages instead of in the send buffer's tail kvec. So now the
      Linux NFS client appends XDR padding to all odd-sized Read chunks.
      This shouldn't be a problem because:
      
       - RFC 8166-compliant servers are supposed to work with or without
         that XDR padding in Read chunks.
      
       - Since the padding is now in the same memory region as the data
         payload, a separate memory registration is not needed. In
         addition, the link layer extends data in RDMA Read responses to
         4-byte boundaries anyway. Thus there is now no savings when the
         padding is not included.
      
      Because older kernels include the payload's XDR padding in the
      tail kvec, a fix there will be more complicated. Thus backporting
      this patch is not recommended.
      
      Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarTom Talpey <tom@talpey.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      2324fbed
    • Chuck Lever's avatar
      rpcrdma: Fix comments about reverse-direction operation · 84dff5eb
      Chuck Lever authored
      
      
      During the final stages of publication of RFC 8167, reviewers
      requested that we use the term "reverse direction" rather than
      "backwards direction". Update comments to reflect this preference.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarTom Talpey <tom@talpey.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      84dff5eb
    • Chuck Lever's avatar
      xprtrdma: Refactor invocations of offset_in_page() · 67b16625
      Chuck Lever authored
      
      
      Clean up so that offset_in_page() is invoked less often in the
      most common case, which is mapping xdr->pages.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarTom Talpey <tom@talpey.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      67b16625
    • Chuck Lever's avatar
      xprtrdma: Simplify rpcrdma_convert_kvec() and frwr_map() · 54e6aec5
      Chuck Lever authored
      
      
      Clean up.
      
      Remove a conditional branch from the SGL set-up loop in frwr_map():
      Instead of using either sg_set_page() or sg_set_buf(), initialize
      the mr_page field properly when rpcrdma_convert_kvec() converts the
      kvec to an SGL entry. frwr_map() can then invoke sg_set_page()
      unconditionally.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarTom Talpey <tom@talpey.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      54e6aec5
    • Chuck Lever's avatar
      xprtrdma: Remove FMR support in rpcrdma_convert_iovs() · 9929f4ad
      Chuck Lever authored
      
      
      Support for FMR was removed by commit ba69cd12 ("xprtrdma:
      Remove support for FMR memory registration") [Dec 2018]. That means
      the buffer-splitting behavior of rpcrdma_convert_kvec(), added by
      commit 821c791a ("xprtrdma: Segment head and tail XDR buffers
      on page boundaries") [Mar 2016], is no longer necessary. FRWR
      memory registration handles this case with aplomb.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      9929f4ad
  7. Jan 25, 2021
  8. Dec 14, 2020
    • Chuck Lever's avatar
      xprtrdma: Fix XDRBUF_SPARSE_PAGES support · 15261b91
      Chuck Lever authored
      
      
      Olga K. observed that rpcrdma_marsh_req() allocates sparse pages
      only when it has determined that a Reply chunk is necessary. There
      are plenty of cases where no Reply chunk is needed, but the
      XDRBUF_SPARSE_PAGES flag is set. The result would be a crash in
      rpcrdma_inline_fixup() when it tries to copy parts of the received
      Reply into a missing page.
      
      To avoid crashing, handle sparse page allocation up front.
      
      Until XATTR support was added, this issue did not appear often
      because the only SPARSE_PAGES consumer always expected a reply large
      enough to always require a Reply chunk.
      
      Reported-by: default avatarOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      15261b91
  9. Dec 02, 2020
  10. Nov 30, 2020
    • Chuck Lever's avatar
      svcrdma: support multiple Read chunks per RPC · d7cc7397
      Chuck Lever authored
      
      
      An efficient way to handle multiple Read chunks is to post them all
      together and then take a single completion. This is also how the
      code is already structured: when the Read completion fires, all
      portions of the incoming RPC message are available to be assembled.
      
      The difficult problem is setting up the Read sink buffers so that
      the server pulls the client's data into place, making subsequent
      pull-up unnecessary. There are several cases:
      
      * No Read chunks. No-op.
      
      * One data item Read chunk. This is the fast case, where the inline
        part of the RPC-over-RDMA message becomes the head and tail, and
        the data item chunk is placed in buf->pages.
      
      * A Position-zero Read chunk. Treated like TCP: the Read chunk is
        pulled into contiguous pages.
      
      + A Position-zero Read chunk with data item chunks. Treated like
        TCP: all of the Read chunks are pulled into contiguous pages.
      
      + Multiple data item chunks. Treated like TCP: the inline part is
        copied and the data item chunks are pulled into contiguous pages.
      
      The "*" cases are already supported. This patch adds support for the
      "+" cases.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      d7cc7397
    • Chuck Lever's avatar
      svcrdma: Use the new parsed chunk list when pulling Read chunks · d96962e6
      Chuck Lever authored
      
      
      As a pre-requisite for handling multiple Read chunks in each Read
      list, convert svc_rdma_recv_read_chunk() to use the new parsed Read
      chunk list.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      d96962e6
    • Chuck Lever's avatar
      svcrdma: Rename info::ri_chunklen · bafe9c27
      Chuck Lever authored
      
      
      I'm about to change the purpose of ri_chunklen: Instead of tracking
      the number of bytes in one Read chunk, it will track the total
      number of bytes in the Read list. Rename it for clarity.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      bafe9c27
    • Chuck Lever's avatar
      svcrdma: Clean up chunk tracepoints · b704be09
      Chuck Lever authored
      
      
      We already have trace_svcrdma_decode_rseg(), which records each
      ingress Read segment. Instead of reporting those again when they
      are about to be posted as RDMA Reads, let's fire one tracepoint
      before posting each type of chunk.
      
      So we'll get:
      
              nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=0 position=0 192@0x013ca9ebfae14000:0xb0010b05
              nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=1 position=0 7688@0x013ca9ebf914e000:0xb0010a05
              nfsd-1998  [002]   321.666615: svcrdma_decode_rseg:  cq.id=4 cid=42 segno=2 position=0 28@0x013ca9ebfae15000:0xb0010905
              nfsd-1998  [002]   321.666622: svcrdma_decode_rqst:  cq.id=4 cid=42 xid=0x013ca9eb vers=1 credits=128 proc=RDMA_NOMSG hdrlen=100
      
              nfsd-1998  [002]   321.666642: svcrdma_post_read_chunk: cq.id=3 cid=112 sqecount=3
      
      kworker/2:1H-221   [002]   321.673949: svcrdma_wc_read:      cq.id=3 cid=112 status=SUCCESS (0/0x0)
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      b704be09
    • Chuck Lever's avatar
      svcrdma: Remove chunk list pointers · 7954c850
      Chuck Lever authored
      
      
      Clean up: These pointers are no longer used.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      7954c850
    • Chuck Lever's avatar
      svcrdma: Support multiple Write chunks in svc_rdma_send_reply_chunk · 41bc163f
      Chuck Lever authored
      
      
      Refactor svc_rdma_send_reply_chunk() so that it Sends only the parts
      of rq_res that do not contain a result payload.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      41bc163f
    • Chuck Lever's avatar
      svcrdma: Support multiple Write chunks in svc_rdma_map_reply_msg() · 2371bcc0
      Chuck Lever authored
      
      
      Refactor: svc_rdma_map_reply_msg() is restructured to DMA map only
      the parts of rq_res that do not contain a result payload.
      
      This change has been tested to confirm that it does not cause a
      regression in the no Write chunk and single Write chunk cases.
      Multiple Write chunks have not been tested.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      2371bcc0
    • Chuck Lever's avatar
      svcrdma: Support multiple write chunks when pulling up · 9d0b09d5
      Chuck Lever authored
      
      
      When counting the number of SGEs needed to construct a Send request,
      do not count result payloads. And, when copying the Reply message
      into the pull-up buffer, result payloads are not to be copied to the
      Send buffer.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      9d0b09d5
    • Chuck Lever's avatar
      svcrdma: Use parsed chunk lists to encode Reply transport headers · 6911f3e1
      Chuck Lever authored
      
      
      Refactor: Instead of re-parsing the ingress RPC Call transport
      header when constructing the egress RPC Reply transport header, use
      the new parsed Write list and Reply chunk, which are version-
      agnostic and already XDR decoded.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      6911f3e1
    • Chuck Lever's avatar
      svcrdma: Use parsed chunk lists to construct RDMA Writes · 7a1cbfa1
      Chuck Lever authored
      
      
      Refactor: Instead of re-parsing the ingress RPC Call transport
      header when constructing RDMA Writes, use the new parsed chunk lists
      for the Write list and Reply chunk, which are version-agnostic and
      already XDR-decoded.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      7a1cbfa1
    • Chuck Lever's avatar
      svcrdma: Use parsed chunk lists to detect reverse direction replies · 58b2e0fe
      Chuck Lever authored
      
      
      Refactor: Don't duplicate header decoding smarts here. Instead, use
      the new parsed chunk lists.
      
      Note that the XID sanity test is also removed. The XID is already
      looked up by the cb handler, and is rejected if it's not recognized.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      58b2e0fe
    • Chuck Lever's avatar
      svcrdma: Use parsed chunk lists to derive the inv_rkey · eb3de6a4
      Chuck Lever authored
      
      
      Refactor: Don't duplicate header decoding smarts here. Instead, use
      the new parsed chunk lists.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      eb3de6a4
    • Chuck Lever's avatar
      svcrdma: Add a "parsed chunk list" data structure · 78147ca8
      Chuck Lever authored
      
      
      This simple data structure binds the location of each data payload
      inside of an RPC message to the chunk that will be used to push it
      to or pull it from the client.
      
      There are several benefits to this small additional overhead:
      
       * It enables support for more than one chunk in incoming Read and
         Write lists.
      
       * It translates the version-specific on-the-wire format into a
         generic in-memory structure, enabling support for multiple
         versions of the RPC/RDMA transport protocol.
      
       * It enables the server to re-organize a chunk list if it needs to
         adjust where Read chunk data lands in server memory without
         altering the contents of the XDR-encoded Receive buffer.
      
      Construction of these lists is done while sanity checking each
      incoming RPC/RDMA header. Subsequent patches will make use of the
      generated data structures.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      78147ca8
    • Chuck Lever's avatar
      svcrdma: Clean up svc_rdma_encode_reply_chunk() · ded380f1
      Chuck Lever authored
      
      
      Refactor: Match the control flow of svc_rdma_encode_write_list().
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      ded380f1
Loading