Skip to content
  1. Nov 29, 2016
  2. Nov 14, 2016
    • Scott Mayhew's avatar
      sunrpc: svc_age_temp_xprts_now should not call setsockopt non-tcp transports · ea08e392
      Scott Mayhew authored
      
      
      This fixes the following panic that can occur with NFSoRDMA.
      
      general protection fault: 0000 [#1] SMP
      Modules linked in: rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi
      scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp
      scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm
      mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm sg ioatdma
      ipmi_devintf ipmi_ssif dcdbas iTCO_wdt iTCO_vendor_support pcspkr
      irqbypass sb_edac shpchp dca crc32_pclmul ghash_clmulni_intel edac_core
      lpc_ich aesni_intel lrw gf128mul glue_helper ablk_helper mei_me mei
      ipmi_si cryptd wmi ipmi_msghandler acpi_pad acpi_power_meter nfsd
      auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod
      crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt ahci fb_sys_fops ttm libahci mlx5_core
      tg3 crct10dif_pclmul drm crct10dif_common
      ptp i2c_core libata crc32c_intel pps_core fjes dm_mirror dm_region_hash
      dm_log dm_mod
      CPU: 1 PID: 120 Comm: kworker/1:1 Not tainted 3.10.0-514.el7.x86_64 #1
      Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
      Workqueue: events check_lifetime
      task: ffff88031f506dd0 ti: ffff88031f584000 task.ti: ffff88031f584000
      RIP: 0010:[<ffffffff8168d847>]  [<ffffffff8168d847>]
      _raw_spin_lock_bh+0x17/0x50
      RSP: 0018:ffff88031f587ba8  EFLAGS: 00010206
      RAX: 0000000000020000 RBX: 20041fac02080072 RCX: ffff88031f587fd8
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 20041fac02080072
      RBP: ffff88031f587bb0 R08: 0000000000000008 R09: ffffffff8155be77
      R10: ffff880322a59b00 R11: ffffea000bf39f00 R12: 20041fac02080072
      R13: 000000000000000d R14: ffff8800c4fbd800 R15: 0000000000000001
      FS:  0000000000000000(0000) GS:ffff880322a40000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3c52d4547e CR3: 00000000019ba000 CR4: 00000000001407e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Stack:
      20041fac02080002 ffff88031f587bd0 ffffffff81557830 20041fac02080002
      ffff88031f587c78 ffff88031f587c40 ffffffff8155ae08 000000010157df32
      0000000800000001 ffff88031f587c20 ffffffff81096acb ffffffff81aa37d0
      Call Trace:
      [<ffffffff81557830>] lock_sock_nested+0x20/0x50
      [<ffffffff8155ae08>] sock_setsockopt+0x78/0x940
      [<ffffffff81096acb>] ? lock_timer_base.isra.33+0x2b/0x50
      [<ffffffff8155397d>] kernel_setsockopt+0x4d/0x50
      [<ffffffffa0386284>] svc_age_temp_xprts_now+0x174/0x1e0 [sunrpc]
      [<ffffffffa03b681d>] nfsd_inetaddr_event+0x9d/0xd0 [nfsd]
      [<ffffffff81691ebc>] notifier_call_chain+0x4c/0x70
      [<ffffffff810b687d>] __blocking_notifier_call_chain+0x4d/0x70
      [<ffffffff810b68b6>] blocking_notifier_call_chain+0x16/0x20
      [<ffffffff815e8538>] __inet_del_ifa+0x168/0x2d0
      [<ffffffff815e8cef>] check_lifetime+0x25f/0x270
      [<ffffffff810a7f3b>] process_one_work+0x17b/0x470
      [<ffffffff810a8d76>] worker_thread+0x126/0x410
      [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
      [<ffffffff810b052f>] kthread+0xcf/0xe0
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      [<ffffffff81696418>] ret_from_fork+0x58/0x90
      [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
      Code: ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb d9 66 0f 1f 44 00 00 0f 1f
      44 00 00 55 48 89 e5 53 48 89 fb e8 7e 04 a0 ff b8 00 00 02 00 <f0> 0f
      c1 03 89 c2 c1 ea 10 66 39 c2 75 03 5b 5d c3 83 e2 fe 0f
      RIP  [<ffffffff8168d847>] _raw_spin_lock_bh+0x17/0x50
      RSP <ffff88031f587ba8>
      
      Signed-off-by: default avatarScott Mayhew <smayhew@redhat.com>
      Fixes: c3d4879e ("sunrpc: Add a function to close temporary transports immediately")
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      ea08e392
  3. Nov 10, 2016
  4. Nov 07, 2016
  5. Nov 01, 2016
  6. Oct 28, 2016
    • Jeff Layton's avatar
      sunrpc: fix some missing rq_rbuffer assignments · 18e601d6
      Jeff Layton authored
      
      
      We've been seeing some crashes in testing that look like this:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
      PGD 212ca2067 PUD 212ca3067 PMD 0
      Oops: 0002 [#1] SMP
      Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ppdev parport_pc i2c_piix4 sg parport i2c_core virtio_balloon pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod ata_generic pata_acpi virtio_scsi 8139too ata_piix libata 8139cp mii virtio_pci floppy virtio_ring serio_raw virtio
      CPU: 1 PID: 1540 Comm: nfsd Not tainted 4.9.0-rc1 #39
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      task: ffff88020d7ed200 task.stack: ffff880211838000
      RIP: 0010:[<ffffffff8135ce99>]  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
      RSP: 0018:ffff88021183bdd0  EFLAGS: 00010206
      RAX: 0000000000000000 RBX: ffff88020d7fa000 RCX: 000000f400000000
      RDX: 0000000000000014 RSI: ffff880212927020 RDI: 0000000000000000
      RBP: ffff88021183be30 R08: 01000000ef896996 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff880211704ca8
      R13: ffff88021473f000 R14: 00000000ef896996 R15: ffff880211704800
      FS:  0000000000000000(0000) GS:ffff88021fc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000212ca1000 CR4: 00000000000006e0
      Stack:
       ffffffffa01ea087 ffffffff63400001 ffff880215145e00 ffff880211bacd00
       ffff88021473f2b8 0000000000000004 00000000d0679d67 ffff880211bacd00
       ffff88020d7fa000 ffff88021473f000 0000000000000000 ffff88020d7faa30
      Call Trace:
       [<ffffffffa01ea087>] ? svc_tcp_recvfrom+0x5a7/0x790 [sunrpc]
       [<ffffffffa01f84d8>] svc_recv+0xad8/0xbd0 [sunrpc]
       [<ffffffffa0262d5e>] nfsd+0xde/0x160 [nfsd]
       [<ffffffffa0262c80>] ? nfsd_destroy+0x60/0x60 [nfsd]
       [<ffffffff810a9418>] kthread+0xd8/0xf0
       [<ffffffff816dbdbf>] ret_from_fork+0x1f/0x40
       [<ffffffff810a9340>] ? kthread_park+0x60/0x60
      Code: 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18 48 8d 76 20 <4c> 89 07 4c 89 4f 08 4c 89 57 10 4c 89 5f 18 48 8d 7f 20 73 d4
      RIP  [<ffffffff8135ce99>] memcpy_orig+0x29/0x110
       RSP <ffff88021183bdd0>
      CR2: 0000000000000000
      
      Both Bruce and Eryu ran a bisect here and found that the problematic
      patch was 68778945 (SUNRPC: Separate buffer pointers for RPC Call and
      Reply messages).
      
      That patch changed rpc_xdr_encode to use a new rq_rbuffer pointer to
      set up the receive buffer, but didn't change all of the necessary
      codepaths to set it properly. In particular the backchannel setup was
      missing.
      
      We need to set rq_rbuffer whenever rq_buffer is set. Ensure that it is.
      
      Reviewed-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Tested-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reported-by: default avatarEryu Guan <guaneryu@gmail.com>
      Tested-by: default avatarEryu Guan <guaneryu@gmail.com>
      Fixes: 68778945 "SUNRPC: Separate buffer pointers..."
      Reported-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      18e601d6
  7. Oct 26, 2016
    • J. Bruce Fields's avatar
      sunrpc: don't pass on-stack memory to sg_set_buf · 2876a344
      J. Bruce Fields authored
      
      
      As of ac4e97ab "scatterlist: sg_set_buf() argument must be in linear
      mapping", sg_set_buf hits a BUG when make_checksum_v2->xdr_process_buf,
      among other callers, passes it memory on the stack.
      
      We only need a scatterlist to pass this to the crypto code, and it seems
      like overkill to require kmalloc'd memory just to encrypt a few bytes,
      but for now this seems the best fix.
      
      Many of these callers are in the NFS write paths, so we allocate with
      GFP_NOFS.  It might be possible to do without allocations here entirely,
      but that would probably be a bigger project.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      2876a344
  8. Oct 08, 2016
    • Alexey Dobriyan's avatar
      cred: simpler, 1D supplementary groups · 81243eac
      Alexey Dobriyan authored
      Current supplementary groups code can massively overallocate memory and
      is implemented in a way so that access to individual gid is done via 2D
      array.
      
      If number of gids is <= 32, memory allocation is more or less tolerable
      (140/148 bytes).  But if it is not, code allocates full page (!)
      regardless and, what's even more fun, doesn't reuse small 32-entry
      array.
      
      2D array means dependent shifts, loads and LEAs without possibility to
      optimize them (gid is never known at compile time).
      
      All of the above is unnecessary.  Switch to the usual
      trailing-zero-len-array scheme.  Memory is allocated with
      kmalloc/vmalloc() and only as much as needed.  Accesses become simpler
      (LEA 8(gi,idx,4) or even without displacement).
      
      Maximum number of gids is 65536 which translates to 256KB+8 bytes.  I
      think kernel can handle such allocation.
      
      On my usual desktop system with whole 9 (nine) aux groups, struct
      group_info shrinks from 148 bytes to 44 bytes, yay!
      
      Nice side effects:
      
       - "gi->gid[i]" is shorter than "GROUP_AT(gi, i)", less typing,
      
       - fix little mess in net/ipv4/ping.c
         should have been using GROUP_AT macro but this point becomes moot,
      
       - aux group allocation is persistent and should be accounted as such.
      
      Link: http://lkml.kernel.org/r/20160817201927.GA2096@p183.telecom.by
      
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Vasily Kulikov <segoon@openwall.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81243eac
  9. Sep 30, 2016
  10. Sep 28, 2016
  11. Sep 27, 2016
    • Ke Wang's avatar
      sunrpc: queue work on system_power_efficient_wq · 77b00bc0
      Ke Wang authored
      
      
      sunrpc uses workqueue to clean cache regulary. There is no real dependency
      of executing work on the cpu which queueing it.
      
      On a idle system, especially for a heterogeneous systems like big.LITTLE,
      it is observed that the big idle cpu was woke up many times just to service
      this work, which against the principle of power saving. It would be better
      if we can schedule it on a cpu which the scheduler believes to be the most
      appropriate one.
      
      After apply this patch, system_wq will be replaced by
      system_power_efficient_wq for sunrpc. This functionality is enabled when
      CONFIG_WQ_POWER_EFFICIENT is selected.
      
      Signed-off-by: default avatarKe Wang <ke.wang@spreadtrum.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      77b00bc0
  12. Sep 23, 2016
    • Christoph Hellwig's avatar
      IB/core: add support to create a unsafe global rkey to ib_create_pd · ed082d36
      Christoph Hellwig authored
      
      
      Instead of exposing ib_get_dma_mr to ULPs and letting them use it more or
      less unchecked, this moves the capability of creating a global rkey into
      the RDMA core, where it can be easily audited.  It also prints a warning
      everytime this feature is used as well.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Reviewed-by: default avatarSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: default avatarDoug Ledford <dledford@redhat.com>
      ed082d36
    • Chuck Lever's avatar
      svcrdma: support Remote Invalidation · 25d55296
      Chuck Lever authored
      
      
      Support Remote Invalidation. A private message is exchanged with
      the client upon RDMA transport connect that indicates whether
      Send With Invalidation may be used by the server to send RPC
      replies. The invalidate_rkey is arbitrarily chosen from among
      rkeys present in the RPC-over-RDMA header's chunk lists.
      
      Send With Invalidate improves performance only when clients can
      recognize, while processing an RPC reply, that an rkey has already
      been invalidated. That has been submitted as a separate change.
      
      In the future, the RPC-over-RDMA protocol might support Remote
      Invalidation properly. The protocol needs to enable signaling
      between peers to indicate when Remote Invalidation can be used
      for each individual RPC.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      25d55296
    • Chuck Lever's avatar
      svcrdma: Server-side support for rpcrdma_connect_private · cc9d8340
      Chuck Lever authored
      
      
      Prepare to receive an RDMA-CM private message when handling a new
      connection attempt, and send a similar message as part of connection
      acceptance.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      cc9d8340
    • Chuck Lever's avatar
      svcrdma: Skip put_page() when send_reply() fails · 9995237b
      Chuck Lever authored
      
      
      Message from syslogd@klimt at Aug 18 17:00:37 ...
       kernel:page:ffffea0020639b00 count:0 mapcount:0 mapping:          (null) index:0x0
      Aug 18 17:00:37 klimt kernel: flags: 0x2fffff80000000()
      Aug 18 17:00:37 klimt kernel: page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
      
      Aug 18 17:00:37 klimt kernel: kernel BUG at /home/cel/src/linux/linux-2.6/include/linux/mm.h:445!
      Aug 18 17:00:37 klimt kernel: RIP: 0010:[<ffffffffa05c21c1>] svc_rdma_sendto+0x641/0x820 [rpcrdma]
      
      send_reply() assigns its page argument as the first page of ctxt. On
      error, send_reply() already invokes svc_rdma_put_context(ctxt, 1);
      which does a put_page() on that very page. No need to do that again
      as svc_rdma_sendto exits.
      
      Fixes: 3e1eeb98 ("svcrdma: Close connection when a send error occurs")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      9995237b
    • Chuck Lever's avatar
      svcrdma: Tail iovec leaves an orphaned DMA mapping · cace564f
      Chuck Lever authored
      
      
      The ctxt's count field is overloaded to mean the number of pages in
      the ctxt->page array and the number of SGEs in the ctxt->sge array.
      Typically these two numbers are the same.
      
      However, when an inline RPC reply is constructed from an xdr_buf
      with a tail iovec, the head and tail often occupy the same page,
      but each are DMA mapped independently. In that case, ->count equals
      the number of pages, but it does not equal the number of SGEs.
      There's one more SGE, for the tail iovec. Hence there is one more
      DMA mapping than there are pages in the ctxt->page array.
      
      This isn't a real problem until the server's iommu is enabled. Then
      each RPC reply that has content in that iovec orphans a DMA mapping
      that consists of real resources.
      
      krb5i and krb5p always populate that tail iovec. After a couple
      million sent krb5i/p RPC replies, the NFS server starts behaving
      erratically. Reboot is needed to clear the problem.
      
      Fixes: 9d11b51c ("svcrdma: Fix send_reply() scatter/gather set-up")
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      cace564f
    • Daniel Wagner's avatar
      xprtrdma: use complete() instead complete_all() · 5690a22d
      Daniel Wagner authored
      
      
      There is only one waiter for the completion, therefore there
      is no need to use complete_all(). Let's make that clear by
      using complete() instead of complete_all().
      
      The usage pattern of the completion is:
      
      waiter context                          waker context
      
      frwr_op_unmap_sync()
        reinit_completion()
        ib_post_send()
        wait_for_completion()
      
      					frwr_wc_localinv_wake()
      					  complete()
      
      Signed-off-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: linux-nfs@vger.kernel.org
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5690a22d
  13. Sep 22, 2016
  14. Sep 19, 2016
    • David Vrabel's avatar
      sunrpc: fix write space race causing stalls · d48f9ce7
      David Vrabel authored
      
      
      Write space becoming available may race with putting the task to sleep
      in xprt_wait_for_buffer_space().  The existing mechanism to avoid the
      race does not work.
      
      This (edited) partial trace illustrates the problem:
      
         [1] rpc_task_run_action: task:43546@5 ... action=call_transmit
         [2] xs_write_space <-xs_tcp_write_space
         [3] xprt_write_space <-xs_write_space
         [4] rpc_task_sleep: task:43546@5 ...
         [5] xs_write_space <-xs_tcp_write_space
      
      [1] Task 43546 runs but is out of write space.
      
      [2] Space becomes available, xs_write_space() clears the
          SOCKWQ_ASYNC_NOSPACE bit.
      
      [3] xprt_write_space() attemts to wake xprt->snd_task (== 43546), but
          this has not yet been queued and the wake up is lost.
      
      [4] xs_nospace() is called which calls xprt_wait_for_buffer_space()
          which queues task 43546.
      
      [5] The call to sk->sk_write_space() at the end of xs_nospace() (which
          is supposed to handle the above race) does not call
          xprt_write_space() as the SOCKWQ_ASYNC_NOSPACE bit is clear and
          thus the task is not woken.
      
      Fix the race by resetting the SOCKWQ_ASYNC_NOSPACE bit in xs_nospace()
      so the second call to sk->sk_write_space() calls xprt_write_space().
      
      Suggested-by: default avatarTrond Myklebust <trondmy@primarydata.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      cc: stable@vger.kernel.org # 4.4
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      d48f9ce7
    • Chuck Lever's avatar
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever authored
      
      
      Clean up: the extra layer of indirection doesn't add value.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • Chuck Lever's avatar
      xprtrdma: Rename rpcrdma_receive_wc() · 1519e969
      Chuck Lever authored
      
      
      Clean up: When converting xprtrdma to use the new CQ API, I missed a
      spot. The naming convention elsewhere is:
      
        {svc_rdma,rpcrdma}_wc_{operation}
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      1519e969
    • Chuck Lever's avatar
      xprtrmda: Report address of frmr, not mw · eeb30613
      Chuck Lever authored
      
      
      Tie frwr debugging messages together by always reporting the address
      of the frwr.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      eeb30613
    • Chuck Lever's avatar
      xprtrdma: Support larger inline thresholds · 44829d02
      Chuck Lever authored
      
      
      The Version One default inline threshold is still 1KB. But allow
      testing with thresholds up to 64KB.
      
      This maximum is somewhat arbitrary. There's no fundamental
      architectural limit I'm aware of, but it's good to keep the size of
      Receive buffers reasonable. Now that Send can use a s/g list, a
      Send buffer is only as large as each RPC requires. Receive buffers
      are always the size of the inline threshold, however.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      44829d02
    • Chuck Lever's avatar
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever authored
      
      
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • Chuck Lever's avatar
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever authored
      
      
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • Chuck Lever's avatar
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever authored
      
      
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • Chuck Lever's avatar
      xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711
      Chuck Lever authored
      
      
      Clean up: The fields in the recv_wr do not vary. There is no need to
      initialize them before each ib_post_recv(). This removes a large-ish
      data structure from the stack.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ea8e711
    • Chuck Lever's avatar
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever authored
      
      
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602
    • Chuck Lever's avatar
      xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a
      Chuck Lever authored
      
      
      Clean up.
      
      Since commit fc664485 ("xprtrdma: Split the completion queue"),
      rpcrdma_ep_post_recv() no longer uses the "ep" argument.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      b157380a
    • Chuck Lever's avatar
      xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23
      Chuck Lever authored
      
      
      Clean up. The "ia" argument is no longer used.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      13650c23
    • Chuck Lever's avatar
      xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0
      Chuck Lever authored
      
      
      Currently, each regbuf is allocated and DMA mapped at the same time.
      This is done during transport creation.
      
      When a device driver is unloaded, every DMA-mapped buffer in use by
      a transport has to be unmapped, and then remapped to the new
      device if the driver is loaded again. Remapping will have to be done
      _after_ the connect worker has set up the new device.
      
      But there's an ordering problem:
      
      call_allocate, which invokes xprt_rdma_allocate which calls
      rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
      the connect worker can run to set up the new device.
      
      Instead, at transport creation, allocate each buffer, but leave it
      unmapped. Once the RPC carries these buffers into ->send_request, by
      which time a transport connection should have been established,
      check to see that the RPC's buffers have been DMA mapped. If not,
      map them there.
      
      When device driver unplug support is added, it will simply unmap all
      the transport's regbufs, but it doesn't have to deallocate the
      underlying memory.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      54cbd6b0
    • Chuck Lever's avatar
      xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3
      Chuck Lever authored
      
      
      The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
      Fortunately, xprtrdma now knows which direction I/O is going as
      soon as it allocates each regbuf.
      
      The RPC Call and Reply buffers are no longer the same regbuf. They
      can each be labeled correctly now. The RPC Reply buffer is never
      part of either a Send or Receive WR, but it can be part of Reply
      chunk, which is mapped and registered via ->ro_map . So it is not
      DMA mapped when it is allocated (DMA_NONE), to avoid a double-
      mapping.
      
      Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
      contents are never modified by the host CPU, DMA-API-HOWTO.txt
      suggests that a DMA sync before posting each buffer should be
      unnecessary. (See my_card_interrupt_handler).
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      99ef4db3
    • Chuck Lever's avatar
      xprtrdma: Use smaller buffers for RPC-over-RDMA headers · 08cf2efd
      Chuck Lever authored
      
      
      Commit 94931746 ("xprtrdma: Limit number of RDMA segments in
      RPC-over-RDMA headers") capped the number of chunks that may appear
      in RPC-over-RDMA headers. The maximum header size can be estimated
      and fixed to avoid allocating buffer space that is never used.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      08cf2efd
    • Chuck Lever's avatar
      xprtrdma: Initialize separate RPC call and reply buffers · 9c40c49f
      Chuck Lever authored
      
      
      RPC-over-RDMA needs to separate its RPC call and reply buffers.
      
       o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
         Send operation using DMA_TO_DEVICE
      
       o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
         as part of a Reply chunk using DMA_FROM_DEVICE
      
      The two mappings are for data movement in opposite directions.
      
      DMA-API.txt suggests that if these mappings share a DMA cacheline,
      bad things can happen. This could occur in the final bytes of
      rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
      happen to share a DMA cacheline.
      
      On x86_64 the cacheline size is typically 8 bytes, and RPC call
      messages are usually much smaller than the send buffer, so this
      hasn't been a noticeable problem. But the DMA cacheline size can be
      larger on other platforms.
      
      Also, often rq_rcv_buf starts most of the way into a page, thus
      an additional RDMA segment is needed to map and register the end of
      that buffer. Try to avoid that scenario to reduce the cost of
      registering and invalidating Reply chunks.
      
      Instead of carrying a single regbuf that covers both rq_snd_buf and
      rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
      rq_snd_buf and one regbuf for rq_rcv_buf.
      
      Some incidental changes worth noting:
      
      - To clear out some spaghetti, refactor xprt_rdma_allocate.
      - The value stored in rg_size is the same as the value stored in
        the iov.length field, so eliminate rg_size
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      9c40c49f
    • Chuck Lever's avatar
      SUNRPC: Add a transport-specific private field in rpc_rqst · 5a6d1db4
      Chuck Lever authored
      
      
      Currently there's a hidden and indirect mechanism for finding the
      rpcrdma_req that goes with an rpc_rqst. It depends on getting from
      the rq_buffer pointer in struct rpc_rqst to the struct
      rpcrdma_regbuf that controls that buffer, and then to the struct
      rpcrdma_req it goes with.
      
      This was done back in the day to avoid the need to add a per-rqst
      pointer or to alter the buf_free API when support for RPC-over-RDMA
      was introduced.
      
      I'm about to change the way regbuf's work to support larger inline
      thresholds. Now is a good time to replace this indirect mechanism
      with something that is more straightforward. I guess this should be
      considered a clean up.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      5a6d1db4
    • Chuck Lever's avatar
      SUNRPC: Separate buffer pointers for RPC Call and Reply messages · 68778945
      Chuck Lever authored
      
      
      For xprtrdma, the RPC Call and Reply buffers are involved in real
      I/O operations.
      
      To start with, the DMA direction of the I/O for a Call is opposite
      that of a Reply.
      
      In the current arrangement, the Reply buffer address is on a
      four-byte alignment just past the call buffer. Would be friendlier
      on some platforms if that was at a DMA cache alignment instead.
      
      Because the current arrangement allocates a single memory region
      which contains both buffers, the RPC Reply buffer often contains a
      page boundary in it when the Call buffer is large enough (which is
      frequent).
      
      It would be a little nicer for setting up DMA operations (and
      possible registration of the Reply buffer) if the two buffers were
      separated, well-aligned, and contained as few page boundaries as
      possible.
      
      Now, I could just pad out the single memory region used for the pair
      of buffers. But frequently that would mean a lot of unused space to
      ensure the Reply buffer did not have a page boundary.
      
      Add a separate pointer to rpc_rqst that points right to the RPC
      Reply buffer. This makes no difference to xprtsock, but it will help
      xprtrdma in subsequent patches.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      68778945
Loading