Skip to content
  1. Feb 06, 2018
    • David Howells's avatar
      afs: Support the AFS dynamic root · 4d673da1
      David Howells authored
      
      
      Support the AFS dynamic root which is a pseudo-volume that doesn't connect
      to any server resource, but rather is just a root directory that
      dynamically creates mountpoint directories where the name of such a
      directory is the name of the cell.
      
      Such a mount can be created thus:
      
      	mount -t afs none /afs -o dyn
      
      Dynamic root superblocks aren't shared except by bind mounts and
      propagation.  Cell root volumes can then be mounted by referring to them by
      name, e.g.:
      
      	ls /afs/grand.central.org/
      	ls /afs/.grand.central.org/
      
      The kernel will upcall to consult the DNS if the address wasn't supplied
      directly.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      4d673da1
    • David Howells's avatar
      afs: Rearrange afs_select_fileserver() a little · 16280a15
      David Howells authored
      
      
      Rearrange afs_select_fileserver() a little to put the use_server chunk
      before the next_server chunk so that with the removal of a couple of gotos
      the main path through the function is all one sequence.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      16280a15
    • David Howells's avatar
      afs: Remove unused code · 63dc4e4a
      David Howells authored
      
      
      Remove some old unused code.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      63dc4e4a
    • David Howells's avatar
      afs: Fix server list handling · 45df8462
      David Howells authored
      
      
      Fix server list handling in the following ways:
      
       (1) In afs_alloc_volume(), remove duplicate server list build code.  This
           was already done by afs_alloc_server_list() which afs_alloc_volume()
           previously called.  This just results in twice as many VL RPCs.
      
       (2) In afs_deliver_vl_get_entry_by_name_u(), use the number of server
           records indicated by ->nServers in the UVLDB record returned by the
           VL.GetEntryByNameU RPC call rather than scanning all NMAXNSERVERS
           slots.  Unused slots may contain garbage.
      
       (3) In afs_alloc_server_list(), don't stop converting a UVLDB record into
           a server list just because we can't look up one of the servers.  Just
           skip that server and go on to the next.  If we can't look up any of
           the servers then we'll fail at the end.
      
      Without this patch, an attempt to view the umich.edu root cell using
      something like "ls /afs/umich.edu" on a dynamic root (future patch) mount
      or an autocell mount will result in ENOMEDIUM.  The failure is due to kafs
      not stopping after nServers'worth of records have been read, but then
      trying to access a server with a garbage UUID and getting an error, which
      aborts the server list build.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Reported-by: default avatarJonathan Billings <jsbillings@jsbillings.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: stable@vger.kernel.org
      45df8462
    • David Howells's avatar
      afs: Need to clear responded flag in addr cursor · 8305e579
      David Howells authored
      
      
      In afs_select_fileserver(), we need to clear the ->responded flag in the
      address list when reusing it.  We should also clear it in
      afs_select_current_fileserver().
      
      To this end, just memset() the object before initialising it.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: stable@vger.kernel.org
      8305e579
    • David Howells's avatar
      afs: Fix missing cursor clearance · fe4d774c
      David Howells authored
      
      
      afs_select_fileserver() ends the address cursor it is using in the case in
      which we get some sort of network error and run out of addresses to iterate
      through, before it jumps to try the next server.  This also needs to be
      done when the server aborts with some sort of error that means we should
      try the next server.
      
      Fix this by:
      
       (1) Move the iterate_address afs_end_cursor() call to the next_server
           case.
      
       (2) End the cursor in the failed case.
      
       (3) Make afs_end_cursor() clear the ->begun flag and ->addr pointer in the
           address cursor.
      
       (4) Make afs_end_cursor() able to be called on an already cleared cursor.
      
      Without this, something like the following oops may occur:
      
      	AFS: Assertion failed
      	18446612134397189888 == 0 is false
      	0xffff88007c279f00 == 0x0 is false
      	------------[ cut here ]------------
      	kernel BUG at fs/afs/rotate.c:360!
      	RIP: 0010:afs_select_fileserver+0x79b/0xa30 [kafs]
      	Call Trace:
      	 afs_statfs+0xcc/0x180 [kafs]
      	 ? p9_client_statfs+0x9e/0x110 [9pnet]
      	 ? _cond_resched+0x19/0x40
      	 statfs_by_dentry+0x6d/0x90
      	 vfs_statfs+0x1b/0xc0
      	 user_statfs+0x4b/0x80
      	 SYSC_statfs+0x15/0x30
      	 SyS_statfs+0xe/0x10
      	 entry_SYSCALL_64_fastpath+0x20/0x83
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: stable@vger.kernel.org
      fe4d774c
    • David Howells's avatar
      afs: Add missing afs_put_cell() · e4415015
      David Howells authored
      
      
      afs_alloc_volume() needs to release the cell ref it obtained in the case of
      an error.  Fix this by adding an afs_put_cell() call into the error path.
      
      This can triggered when a lookup for a cell in a dynamic root or an
      autocell mount returns an error whilst trying to look up the server (such
      as ENOMEDIUM).  This results in an assertion failure oops when the module
      is unloaded due to outstanding refs on a cell record.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: stable@vger.kernel.org
      e4415015
  2. Jan 29, 2018
    • Jeff Layton's avatar
      afs: convert to new i_version API · a01179e6
      Jeff Layton authored
      
      
      For AFS, it's generally treated as an opaque value, so we use the
      *_raw variants of the API here.
      
      Note that AFS has quite a different definition for this counter. AFS
      only increments it on changes to the data to the data in regular files
      and contents of the directories. Inode metadata changes do not result
      in a version increment.
      
      We'll need to reconcile that somehow if we ever want to present this to
      userspace via statx.
      
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      a01179e6
  3. Jan 02, 2018
  4. Dec 01, 2017
    • David Howells's avatar
      afs: Properly reset afs_vnode (inode) fields · f8de483e
      David Howells authored
      
      
      When an AFS inode is allocated by afs_alloc_inode(), the allocated
      afs_vnode struct isn't necessarily reset from the last time it was used as
      an inode because the slab constructor is only invoked once when the memory
      is obtained from the page allocator.
      
      This means that information can leak from one inode to the next because
      we're not calling kmem_cache_zalloc().  Some of the information isn't
      reset, in particular the permit cache pointer.
      
      Bring the clearances up to date.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      f8de483e
    • David Howells's avatar
      afs: Fix permit refcounting · 1bcab125
      David Howells authored
      
      
      Fix four refcount bugs in afs_cache_permit():
      
       (1) When checking the result of the kzalloc(), we can't just return, but
           must put 'permits'.
      
       (2) We shouldn't put permits immediately after hashing a new permit as we
           need to keep the pointer stable so that we can check to see if
           vnode->permit_cache has changed before we decide whether to assign to
           it.
      
       (3) 'permits' is being put twice.
      
       (4) We need to put either the replacement or the thing replaced after the
           assignment to vnode->permit_cache.
      
      Without this, lots of the following are seen:
      
        Kernel BUG at ffffffffa039857b [verbose debug info unavailable]
        ------------[ cut here ]------------
        Kernel BUG at ffffffffa039858a [verbose debug info unavailable]
        ------------[ cut here ]------------
      
      The addresses are in the .text..refcount section of the kafs.ko module.
      Following the relocation records for the __ex_table section shows one to be
      due to the decrement in afs_put_permits() and the other to be key_get() in
      afs_cache_permit().
      
      Occasionally, the following is seen:
      
        refcount_t overflow at afs_cache_permit+0x57d/0x5c0 [kafs] in cc1[562], uid/euid: 0/0
        WARNING: CPU: 0 PID: 562 at kernel/panic.c:657 refcount_error_report+0x9c/0xac
        ...
      
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      1bcab125
  5. Nov 27, 2017
    • Linus Torvalds's avatar
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds authored
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBIND...
      1751e8a6
  6. Nov 24, 2017
    • Colin Ian King's avatar
      afs: remove redundant assignment of dvnode to itself · 43dd388b
      Colin Ian King authored
      
      
      The assignment of dvnode to itself is redundant and can be removed.
      Cleans up warning detected by cppcheck:
      
      fs/afs/dir.c:975: (warning) Redundant assignment of 'dvnode' to itself.
      
      Fixes: d2ddc776 ("afs: Overhaul volume and server record caching and fileserver rotation")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      43dd388b
    • Gustavo A. R. Silva's avatar
      afs: cell: Remove unnecessary code in afs_lookup_cell · 68327951
      Gustavo A. R. Silva authored
      Due to recent changes this piece of code is no longer needed.
      
      Addresses-Coverity-ID: 1462033
      Link: https://lkml.kernel.org/r/4923.1510957307@warthog.procyon.org.uk
      
      
      Signed-off-by: default avatarGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      68327951
    • David Howells's avatar
      afs: Fix signal handling in some file ops · 4433b691
      David Howells authored
      
      
      afs_mkdir(), afs_create(), afs_link() and afs_symlink() all need to drop
      the target dentry if a signal causes the operation to be killed immediately
      before we try to contact the server.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      4433b691
    • David Howells's avatar
      afs: Fix some dentry handling in dir ops and missing key_puts · bc1527dc
      David Howells authored
      
      
      Fix some of dentry handling in AFS directory ops:
      
       (1) Do d_drop() on the new_dentry before assigning a new inode to it in
           afs_vnode_new_inode().  It's fine to do this before calling afs_iget()
           because the operation has taken place on the server.
      
       (2) Replace d_instantiate()/d_rehash() with d_add().
      
       (3) Don't d_drop() the new_dentry in afs_rename() on error.
      
      Also fix afs_link() and afs_rename() to call key_put() on all error paths
      where the key is taken.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      bc1527dc
    • David Howells's avatar
      afs: Make afs_write_begin() avoid writing to a page that's being stored · 5a039c32
      David Howells authored
      
      
      Make afs_write_begin() wait for a page that's marked PG_writeback because:
      
       (1) We need to avoid interference with the data being stored so that the
           data on the server ends up in a defined state.
      
       (2) page->private is used to track the window of dirty data within a page,
           but it's also used by the storage code to track what's being written,
           being cleared by the completion notification.  Ownership can't be
           relinquished by the storage code until completion because it a store
           fails, the data must be remarked dirty.
      
      Tracing shows something like the following (edited):
      
       x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-125
          kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store+ 0-125
       x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-2052
          kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 clear 0-2052
          kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store 0-0
          kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 WARN 0-0
      
      The clear (completion) corresponding to the store+ (store continuation from
      a previous page) happens between the second begin (afs_write_begin) and the
      store corresponding to that.  This results in the second store not seeing
      any data to write back, leading to the following warning:
      
      WARNING: CPU: 2 PID: 114 at ../fs/afs/write.c:403 afs_write_back_from_locked_page+0x19d/0x76c [kafs]
      Modules linked in: kafs(E)
      CPU: 2 PID: 114 Comm: kworker/u8:3 Tainted: G            E   4.14.0-fscache+ #242
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Workqueue: writeback wb_workfn (flush-afs-2)
      task: ffff8800cad72600 task.stack: ffff8800cad44000
      RIP: 0010:afs_write_back_from_locked_page+0x19d/0x76c [kafs]
      RSP: 0018:ffff8800cad47aa0 EFLAGS: 00010246
      RAX: 0000000000000001 RBX: ffff8800bef33a20 RCX: 0000000000000000
      RDX: 000000000000000f RSI: ffffffff81c5d0e0 RDI: ffff8800cad72e78
      RBP: ffff8800d31ea1e8 R08: ffff8800c1358000 R09: ffff8800ca00e400
      R10: ffff8800cad47a38 R11: ffff8800c5d9e400 R12: 0000000000000000
      R13: ffffea0002d9df00 R14: ffffffffa0023c1c R15: 0000000000007fdf
      FS:  0000000000000000(0000) GS:ffff8800ca700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f85ac6c4000 CR3: 0000000001c10001 CR4: 00000000001606e0
      Call Trace:
       ? clear_page_dirty_for_io+0x23a/0x267
       afs_writepages_region+0x1be/0x286 [kafs]
       afs_writepages+0x60/0x127 [kafs]
       do_writepages+0x36/0x70
       __writeback_single_inode+0x12f/0x635
       writeback_sb_inodes+0x2cc/0x452
       __writeback_inodes_wb+0x68/0x9f
       wb_writeback+0x208/0x470
       ? wb_workfn+0x22b/0x565
       wb_workfn+0x22b/0x565
       ? worker_thread+0x230/0x2ac
       process_one_work+0x2cc/0x517
       ? worker_thread+0x230/0x2ac
       worker_thread+0x1d4/0x2ac
       ? rescuer_thread+0x29b/0x29b
       kthread+0x15d/0x165
       ? kthread_create_on_node+0x3f/0x3f
       ? call_usermodehelper_exec_async+0x118/0x11f
       ret_from_fork+0x24/0x30
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      5a039c32
  7. Nov 17, 2017
    • David Howells's avatar
      afs: Fix file locking · 0fafdc9f
      David Howells authored
      
      
      Fix the AFS file locking whereby the use of the big kernel lock (which
      could be slept with) was replaced by a spinlock (which couldn't).  The
      problem is that the AFS code was doing stuff inside the critical section
      that might call schedule(), so this is a broken transformation.
      
      Fix this by the following means:
      
       (1) Use a state machine with a proper state that can only be changed under
           the spinlock rather than using a collection of bit flags.
      
       (2) Cache the key used for the lock and the lock type in the afs_vnode
           struct so that the manager work function doesn't have to refer to a
           file_lock struct that's been dequeued.  This makes signal handling
           safer.
      
       (4) Move the unlock from afs_do_unlk() to afs_fl_release_private() which
           means that unlock is achieved in other circumstances too.
      
       (5) Unlock the file on the server before taking the next conflicting lock.
      
      Also change:
      
       (1) Check the permits on a file before actually trying the lock.
      
       (2) fsync the file before effecting an explicit unlock operation.  We
           don't fsync if the lock is erased otherwise as we might not be in a
           context where we can actually do that.
      
      Further fixes:
      
       (1) Fixed-fileserver address rotation is made to work.  It's only used by
           the locking functions, so couldn't be tested before.
      
      Fixes: 72f98e72 ("locks: turn lock_flocks into a spinlock")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: jlayton@redhat.com
      0fafdc9f
  8. Nov 16, 2017
  9. Nov 13, 2017
    • David Howells's avatar
      afs: Protect call->state changes against signals · 98bf40cd
      David Howells authored
      
      
      Protect call->state changes against the call being prematurely terminated
      due to a signal.
      
      What can happen is that a signal causes afs_wait_for_call_to_complete() to
      abort an afs_call because it's not yet complete whilst afs_deliver_to_call()
      is delivering data to that call.
      
      If the data delivery causes the state to change, this may overwrite the state
      of the afs_call, making it not-yet-complete again - but no further
      notifications will be forthcoming from AF_RXRPC as the rxrpc call has been
      aborted and completed, so kAFS will just hang in various places waiting for
      that call or on page bits that need clearing by that call.
      
      A tracepoint to monitor call state changes is also provided.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      98bf40cd
    • David Howells's avatar
      afs: Trace page dirty/clean · 13524ab3
      David Howells authored
      
      
      Add a trace event that logs the dirtying and cleaning of pages attached to
      AFS inodes.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      13524ab3
    • David Howells's avatar
      afs: Implement shared-writeable mmap · 1cf7a151
      David Howells authored
      
      
      Implement shared-writeable mmap for AFS.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      1cf7a151
    • David Howells's avatar
      afs: Get rid of the afs_writeback record · 4343d008
      David Howells authored
      
      
      Get rid of the afs_writeback record that kAFS is using to match keys with
      writes made by that key.
      
      Instead, keep a list of keys that have a file open for writing and/or
      sync'ing and iterate through those.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      4343d008
    • David Howells's avatar
      afs: Introduce a file-private data record · 215804a9
      David Howells authored
      
      
      Introduce a file-private data record for kAFS and put the key into it
      rather than storing the key in file->private_data.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      215804a9
    • Marc Dionne's avatar
      afs: Use a dynamic port if 7001 is in use · 83732ec5
      Marc Dionne authored
      
      
      It is not required that the afs client operate on port 7001.
      The port could be in use because another kernel or userspace
      client has already bound to it.
      
      If the port is in use, just fallback to using a dynamic port.
      
      Signed-off-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      83732ec5
    • David Howells's avatar
      afs: Fix directory read/modify race · dab17c1a
      David Howells authored
      
      
      Because parsing of the directory wasn't being done under any sort of lock,
      the pages holding the directory content can get invalidated whilst the
      parsing is ongoing.
      
      Further, the directory page check function gets called outside of the page
      lock, so if the page gets cleared or updated, this may return reports of
      bad magic numbers in the directory page.
      
      Also, the directory may change size whilst checking and parsing are
      ongoing, so more care needs to be taken here.
      
      Fix this by:
      
       (1) Perform the page check from the page filling function before we set
           PageUptodate and drop the page lock.
      
       (2) Check for the file having shrunk and the page having been abandoned
           before checking the page contents.
      
       (3) Lock the page whilst parsing it for the directory iterator.
      
      Whilst we're at it, add a tracepoint to report check failure.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      dab17c1a
    • David Howells's avatar
      afs: Trace the sending of pages · 2c099014
      David Howells authored
      
      
      Add a pair of tracepoints to log the sending of pages for an FS.StoreData
      or FS.StoreData64 operation.
      
      Tracepoint afs_send_pages notes each set of pages added to the operation.
      There may be several of these per operation as we get up at most 8
      contiguous pages in one go because the bvec we're using is on the stack.
      
      Tracepoint afs_sent_pages notes the end of adding data from a whole run of
      pages to the operation and the completion of the request phase.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      2c099014
    • David Howells's avatar
      afs: Trace the initiation and completion of client calls · 025db80c
      David Howells authored
      
      
      Add tracepoints to trace the initiation and completion of client calls
      within the kafs filesystem.
      
      The afs_make_vl_call tracepoint watches calls to the volume location
      database server.
      
      The afs_make_fs_call tracepoint watches calls to the file server.
      
      The afs_call_done tracepoint watches for call completion.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      025db80c
    • David Howells's avatar
      afs: Fix total-length calculation for multiple-page send · 1199db60
      David Howells authored
      
      
      Fix the total-length calculation in afs_make_call() when the operation
      being dispatched has data from a series of pages attached.
      
      Despite the patched code looking like that it should reduce mathematically
      to the current code, it doesn't because the 32-bit unsigned arithmetic
      being used to calculate the page-offset-difference doesn't correctly extend
      to a 64-bit value when the result is effectively negative.
      
      Without this, some FS.StoreData operations that span multiple pages fail,
      reporting too little or too much data.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      1199db60
    • David Howells's avatar
      afs: Only progress call state at end of Tx phase from rxrpc callback · 5f0fc8ba
      David Howells authored
      
      
      Only progress the AFS call state at the end of Tx phase from the callback
      passed to rxrpc_kernel_send_data() rather than setting it before the last
      data send call.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      5f0fc8ba
    • David Howells's avatar
      afs: Make use of the YFS service upgrade to fully support IPv6 · bf99a53c
      David Howells authored
      
      
      YFS VL servers offer an upgraded Volume Location service that can return
      IPv6 addresses to fileservers and volume servers in addition to IPv4
      addresses using the YFSVL.GetEndpoints operation which we should use if
      it's available.
      
      To this end:
      
       (1) Make rxrpc_kernel_recv_data() return the call's current service ID so
           that the caller can detect service upgrade and see what the service
           was upgraded to.
      
       (2) When we see a VL server address we haven't seen before, send a
           VL.GetCapabilities operation to it with the service upgrade bit set.
      
           If we get an upgrade to the YFS VL service, change the service ID in
           the address list for that address to use the upgraded service and set
           a flag to note that this appears to be a YFS-compatible server.
      
       (3) If, when a server's addresses are being looked up, we note that we
           previously detected a YFS-compatible server, then send the
           YFSVL.GetEndpoints operation rather than VL.GetAddrsU.
      
       (4) Build a fileserver address list from the reply of YFSVL.GetEndpoints,
           including both IPv4 and IPv6 addresses.  Volume server addresses are
           discarded.
      
       (5) The address list is sorted by address and port now, instead of just
           address.  This allows multiple servers on the same host sitting on
           different ports.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      bf99a53c
    • David Howells's avatar
      afs: Overhaul volume and server record caching and fileserver rotation · d2ddc776
      David Howells authored
      
      
      The current code assumes that volumes and servers are per-cell and are
      never shared, but this is not enforced, and, indeed, public cells do exist
      that are aliases of each other.  Further, an organisation can, say, set up
      a public cell and a private cell with overlapping, but not identical, sets
      of servers.  The difference is purely in the database attached to the VL
      servers.
      
      The current code will malfunction if it sees a server in two cells as it
      assumes global address -> server record mappings and that each server is in
      just one cell.
      
      Further, each server may have multiple addresses - and may have addresses
      of different families (IPv4 and IPv6, say).
      
      To this end, the following structural changes are made:
      
       (1) Server record management is overhauled:
      
           (a) Server records are made independent of cell.  The namespace keeps
           	 track of them, volume records have lists of them and each vnode
           	 has a server on which its callback interest currently resides.
      
           (b) The cell record no longer keeps a list of servers known to be in
           	 that cell.
      
           (c) The server records are now kept in a flat list because there's no
           	 single address to sort on.
      
           (d) Server records are now keyed by their UUID within the namespace.
      
           (e) The addresses for a server are obtained with the VL.GetAddrsU
           	 rather than with VL.GetEntryByName, using the server's UUID as a
           	 parameter.
      
           (f) Cached server records are garbage collected after a period of
           	 non-use and are counted out of existence before purging is allowed
           	 to complete.  This protects the work functions against rmmod.
      
           (g) The servers list is now in /proc/fs/afs/servers.
      
       (2) Volume record management is overhauled:
      
           (a) An RCU-replaceable server list is introduced.  This tracks both
           	 servers and their coresponding callback interests.
      
           (b) The superblock is now keyed on cell record and numeric volume ID.
      
           (c) The volume record is now tied to the superblock which mounts it,
           	 and is activated when mounted and deactivated when unmounted.
           	 This makes it easier to handle the cache cookie without causing a
           	 double-use in fscache.
      
           (d) The volume record is loaded from the VLDB using VL.GetEntryByNameU
           	 to get the server UUID list.
      
           (e) The volume name is updated if it is seen to have changed when the
           	 volume is updated (the update is keyed on the volume ID).
      
       (3) The vlocation record is got rid of and VLDB records are no longer
           cached.  Sufficient information is stored in the volume record, though
           an update to a volume record is now no longer shared between related
           volumes (volumes come in bundles of three: R/W, R/O and backup).
      
      and the following procedural changes are made:
      
       (1) The fileserver cursor introduced previously is now fleshed out and
           used to iterate over fileservers and their addresses.
      
       (2) Volume status is checked during iteration, and the server list is
           replaced if a change is detected.
      
       (3) Server status is checked during iteration, and the address list is
           replaced if a change is detected.
      
       (4) The abort code is saved into the address list cursor and -ECONNABORTED
           returned in afs_make_call() if a remote abort happened rather than
           translating the abort into an error message.  This allows actions to
           be taken depending on the abort code more easily.
      
           (a) If a VMOVED abort is seen then this is handled by rechecking the
           	 volume and restarting the iteration.
      
           (b) If a VBUSY, VRESTARTING or VSALVAGING abort is seen then this is
               handled by sleeping for a short period and retrying and/or trying
               other servers that might serve that volume.  A message is also
               displayed once until the condition has cleared.
      
           (c) If a VOFFLINE abort is seen, then this is handled as VBUSY for the
           	 moment.
      
           (d) If a VNOVOL abort is seen, the volume is rechecked in the VLDB to
           	 see if it has been deleted; if not, the fileserver is probably
           	 indicating that the volume couldn't be attached and needs
           	 salvaging.
      
           (e) If statfs() sees one of these aborts, it does not sleep, but
           	 rather returns an error, so as not to block the umount program.
      
       (5) The fileserver iteration functions in vnode.c are now merged into
           their callers and more heavily macroised around the cursor.  vnode.c
           is removed.
      
       (6) Operations on a particular vnode are serialised on that vnode because
           the server will lock that vnode whilst it operates on it, so a second
           op sent will just have to wait.
      
       (7) Fileservers are probed with FS.GetCapabilities before being used.
           This is where service upgrade will be done.
      
       (8) A callback interest on a fileserver is set up before an FS operation
           is performed and passed through to afs_make_call() so that it can be
           set on the vnode if the operation returns a callback.  The callback
           interest is passed through to afs_iget() also so that it can be set
           there too.
      
      In general, record updating is done on an as-needed basis when we try to
      access servers, volumes or vnodes rather than offloading it to work items
      and special threads.
      
      Notes:
      
       (1) Pre AFS-3.4 servers are no longer supported, though this can be added
           back if necessary (AFS-3.4 was released in 1998).
      
       (2) VBUSY is retried forever for the moment at intervals of 1s.
      
       (3) /proc/fs/afs/<cell>/servers no longer exists.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      d2ddc776
    • David Howells's avatar
      afs: Move server rotation code into its own file · 9cc6fc50
      David Howells authored
      
      
      Move server rotation code into its own file.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      9cc6fc50
    • David Howells's avatar
      afs: Add an address list concept · 8b2a464c
      David Howells authored
      
      
      Add an RCU replaceable address list structure to hold a list of server
      addresses.  The list also holds the
      
      To this end:
      
       (1) A cell's VL server address list can be loaded directly via insmod or
           echo to /proc/fs/afs/cells or dynamically from a DNS query for AFSDB
           or SRV records.
      
       (2) Anyone wanting to use a cell's VL server address must wait until the
           cell record comes online and has tried to obtain some addresses.
      
       (3) An FS server's address list, for the moment, has a single entry that
           is the key to the server list.  This will change in the future when a
           server is instead keyed on its UUID and the VL.GetAddrsU operation is
           used.
      
       (4) An 'address cursor' concept is introduced to handle iteration through
           the address list.  This is passed to the afs_make_call() as, in the
           future, stuff (such as abort code) that doesn't outlast the call will
           be returned in it.
      
      In the future, we might want to annotate the list with information about
      how each address fares.  We might then want to propagate such annotations
      over address list replacement.
      
      Whilst we're at it, we allow IPv6 addresses to be specified in
      colon-delimited lists by enclosing them in square brackets.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      8b2a464c
    • David Howells's avatar
      afs: Overhaul cell database management · 989782dc
      David Howells authored
      
      
      Overhaul the way that the in-kernel AFS client keeps track of cells in the
      following manner:
      
       (1) Cells are now held in an rbtree to make walking them quicker and RCU
           managed (though this is probably overkill).
      
       (2) Cells now have a manager work item that:
      
           (A) Looks after fetching and refreshing the VL server list.
      
           (B) Manages cell record lifetime, including initialising and
           	 destruction.
      
           (B) Manages cell record caching whereby threads are kept around for a
           	 certain time after last use and then destroyed.
      
           (C) Manages the FS-Cache index cookie for a cell.  It is not permitted
           	 for a cookie to be in use twice, so we have to be careful to not
           	 allow a new cell record to exist at the same time as an old record
           	 of the same name.
      
       (3) Each AFS network namespace is given a manager work item that manages
           the cells within it, maintaining a single timer to prod cells into
           updating their DNS records.
      
           This uses the reduce_timer() facility to make the timer expire at the
           soonest timed event that needs happening.
      
       (4) When a module is being unloaded, cells and cell managers are now
           counted out using dec_after_work() to make sure the module text is
           pinned until after the data structures have been cleaned up.
      
       (5) Each cell's VL server list is now protected by a seqlock rather than a
           semaphore.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      989782dc
    • David Howells's avatar
      afs: Overhaul permit caching · be080a6f
      David Howells authored
      
      
      Overhaul permit caching in AFS by making it per-vnode and sharing permit
      lists where possible.
      
      When most of the fileserver operations are called, they return a status
      structure indicating the (revised) details of the vnode or vnodes involved
      in the operation.  This includes the access mark derived from the ACL
      (named CallerAccess in the protocol definition file).  This is cacheable
      and if the ACL changes, the server will tell us that it is breaking the
      callback promise, at which point we can discard the currently cached
      permits.
      
      With this patch, the afs_permits structure has, at the end, an array of
      { key, CallerAccess } elements, sorted by key pointer.  This is then cached
      in a hash table so that it can be shared between vnodes with the same
      access permits.
      
      Permit lists can only be shared if they contain the exact same set of
      key->CallerAccess mappings.
      
      Note that that table is global rather than being per-net_ns.  If the keys
      in a permit list cross net_ns boundaries, there is no problem sharing the
      cached permits, since the permits are just integer masks.
      
      Since permit lists pin keys, the permit cache also makes it easier for a
      future patch to find all occurrences of a key and remove them by means of
      setting the afs_permits::invalidated flag and then clearing the appropriate
      key pointer.  In such an event, memory barriers will need adding.
      
      Lastly, the permit caching is skipped if the server has sent either a
      vnode-specific or an entire-server callback since the start of the
      operation.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      be080a6f
    • David Howells's avatar
      afs: Overhaul the callback handling · c435ee34
      David Howells authored
      
      
      Overhaul the AFS callback handling by the following means:
      
       (1) Don't give up callback promises on vnodes that we are no longer using,
           rather let them just expire on the server or let the server break
           them.  This is actually more efficient for the server as the callback
           lookup is expensive if there are lots of extant callbacks.
      
       (2) Only give up the callback promises we have from a server when the
           server record is destroyed.  Then we can just give up *all* the
           callback promises on it in one go.
      
       (3) Servers can end up being shared between cells if cells are aliased, so
           don't add all the vnodes being backed by a particular server into a
           big FID-indexed tree on that server as there may be duplicates.
      
           Instead have each volume instance (~= superblock) register an interest
           in a server as it starts to make use of it and use this to allow the
           processor for callbacks from the server to find the superblock and
           thence the inode corresponding to the FID being broken by means of
           ilookup_nowait().
      
       (4) Rather than iterating over the entire callback list when a mass-break
           comes in from the server, maintain a counter of mass-breaks in
           afs_server (cb_seq) and make afs_validate() check it against the copy
           in afs_vnode.
      
           It would be nice not to have to take a read_lock whilst doing this,
           but that's tricky without using RCU.
      
       (5) Save a ref on the fileserver we're using for a call in the afs_call
           struct so that we can access its cb_s_break during call decoding.
      
       (6) Write-lock around callback and status storage in a vnode and read-lock
           around getattr so that we don't see the status mid-update.
      
      This has the following consequences:
      
       (1) Data invalidation isn't seen until someone calls afs_validate() on a
           vnode.  Unfortunately, we need to use a key to query the server, but
           getting one from a background thread is tricky without caching loads
           of keys all over the place.
      
       (2) Mass invalidation isn't seen until someone calls afs_validate().
      
       (3) Callback breaking is going to hit the inode_hash_lock quite a bit.
           Could this be replaced with rcu_read_lock() since inodes are destroyed
           under RCU conditions.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      c435ee34
Loading