Skip to content
  1. Jun 07, 2021
  2. May 27, 2021
  3. May 25, 2021
  4. May 01, 2021
    • David Howells's avatar
      afs: Fix speculative status fetches · 22650f14
      David Howells authored
      
      
      The generic/464 xfstest causes kAFS to emit occasional warnings of the
      form:
      
              kAFS: vnode modified {100055:8a} 30->31 YFS.StoreData64 (c=6015)
      
      This indicates that the data version received back from the server did not
      match the expected value (the DV should be incremented monotonically for
      each individual modification op committed to a vnode).
      
      What is happening is that a lookup call is doing a bulk status fetch
      speculatively on a bunch of vnodes in a directory besides getting the
      status of the vnode it's actually interested in.  This is racing with a
      StoreData operation (though it could also occur with, say, a MakeDir op).
      
      On the client, a modification operation locks the vnode, but the bulk
      status fetch only locks the parent directory, so no ordering is imposed
      there (thereby avoiding an avenue to deadlock).
      
      On the server, the StoreData op handler doesn't lock the vnode until it's
      received all the request data, and downgrades the lock after committing the
      data until it has finished sending change notifications to other clients -
      which allows the status fetch to occur before it has finished.
      
      This means that:
      
       - a status fetch can access the target vnode either side of the exclusive
         section of the modification
      
       - the status fetch could start before the modification, yet finish after,
         and vice-versa.
      
       - the status fetch and the modification RPCs can complete in either order.
      
       - the status fetch can return either the before or the after DV from the
         modification.
      
       - the status fetch might regress the locally cached DV.
      
      Some of these are handled by the previous fix[1], but that's not sufficient
      because it checks the DV it received against the DV it cached at the start
      of the op, but the DV might've been updated in the meantime by a locally
      generated modification op.
      
      Fix this by the following means:
      
       (1) Keep track of when we're performing a modification operation on a
           vnode.  This is done by marking vnode parameters with a 'modification'
           note that causes the AFS_VNODE_MODIFYING flag to be set on the vnode
           for the duration.
      
       (2) Alter the speculation race detection to ignore speculative status
           fetches if either the vnode is marked as being modified or the data
           version number is not what we expected.
      
      Note that whilst the "vnode modified" warning does get recovered from as it
      causes the client to refetch the status at the next opportunity, it will
      also invalidate the pagecache, so changes might get lost.
      
      Fixes: a9e5c87c ("afs: Fix speculative status fetch going out of order wrt to modifications")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-and-reviewed-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/160605082531.252452.14708077925602709042.stgit@warthog.procyon.org.uk/ [1]
      Link: https://lore.kernel.org/linux-fsdevel/161961335926.39335.2552653972195467566.stgit@warthog.procyon.org.uk/
      
       # v1
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22650f14
  5. Apr 23, 2021
  6. Mar 23, 2021
  7. Mar 15, 2021
  8. Mar 08, 2021
  9. Jan 30, 2021
  10. Jan 24, 2021
    • Christian Brauner's avatar
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner authored
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • Christian Brauner's avatar
      stat: handle idmapped mounts · 0d56a451
      Christian Brauner authored
      The generic_fillattr() helper fills in the basic attributes associated
      with an inode. Enable it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount map it into the mount's user
      namespace before we store the uid and gid. If the initial user namespace
      is passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      0d56a451
    • Christian Brauner's avatar
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner authored
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
  11. Jan 04, 2021
    • David Howells's avatar
      afs: Fix directory entry size calculation · 366911cd
      David Howells authored
      
      
      The number of dirent records used by an AFS directory entry should be
      calculated using the assumption that there is a 16-byte name field in the
      first block, rather than a 20-byte name field (which is actually the case).
      This miscalculation is historic and effectively standard, so we have to use
      it.
      
      The calculation we need to use is:
      
      	1 + (((strlen(name) + 1) + 15) >> 5)
      
      where we are adding one to the strlen() result to account for the NUL
      termination.
      
      Fix this by the following means:
      
       (1) Create an inline function to do the calculation for a given name
           length.
      
       (2) Use the function to calculate the number of records used for a dirent
           in afs_dir_iterate_block().
      
           Use this to move the over-end check out of the loop since it only
           needs to be done once.
      
           Further use this to only go through the loop for the 2nd+ records
           composing an entry.  The only test there now is for if the record is
           allocated - and we already checked the first block at the top of the
           outer loop.
      
       (3) Add a max name length check in afs_dir_iterate_block().
      
       (4) Make afs_edit_dir_add() and afs_edit_dir_remove() use the function
           from (1) to calculate the number of blocks rather than doing it
           incorrectly themselves.
      
      Fixes: 63a4681f ("afs: Locally edit directory data for mkdir/create/unlink/...")
      Fixes: ^1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      366911cd
    • David Howells's avatar
      afs: Work around strnlen() oops with CONFIG_FORTIFIED_SOURCE=y · 26982a89
      David Howells authored
      
      
      AFS has a structured layout in its directory contents (AFS dirs are
      downloaded as files and parsed locally by the client for lookup/readdir).
      The slots in the directory are defined by union afs_xdr_dirent.  This,
      however, only directly allows a name of a length that will fit into that
      union.  To support a longer name, the next 1-8 contiguous entries are
      annexed to the first one and the name flows across these.
      
      afs_dir_iterate_block() uses strnlen(), limited to the space to the end of
      the page, to find out how long the name is.  This worked fine until
      6a39e62a.  With that commit, the compiler determines the size of the
      array and asserts that the string fits inside that array.  This is a
      problem for AFS because we *expect* it to overflow one or more arrays.
      
      A similar problem also occurs in afs_dir_scan_block() when a directory file
      is being locally edited to avoid the need to redownload it.  There strlen()
      was being used safely because each page has the last byte set to 0 when the
      file is downloaded and validated (in afs_dir_check_page()).
      
      Fix this by changing the afs_xdr_dirent union name field to an
      indeterminate-length array and dropping the overflow field.
      
      (Note that whilst looking at this, I realised that the calculation of the
      number of slots a dirent used is non-standard and not quite right, but I'll
      address that in a separate patch.)
      
      The issue can be triggered by something like:
      
              touch /afs/example.com/thisisaveryveryverylongname
      
      and it generates a report that looks like:
      
              detected buffer overflow in strnlen
              ------------[ cut here ]------------
              kernel BUG at lib/string.c:1149!
              ...
              RIP: 0010:fortify_panic+0xf/0x11
              ...
              Call Trace:
               afs_dir_iterate_block+0x12b/0x35b
               afs_dir_iterate+0x14e/0x1ce
               afs_do_lookup+0x131/0x417
               afs_lookup+0x24f/0x344
               lookup_open.isra.0+0x1bb/0x27d
               open_last_lookups+0x166/0x237
               path_openat+0xe0/0x159
               do_filp_open+0x48/0xa4
               ? kmem_cache_alloc+0xf5/0x16e
               ? __clear_close_on_exec+0x13/0x22
               ? _raw_spin_unlock+0xa/0xb
               do_sys_openat2+0x72/0xde
               do_sys_open+0x3b/0x58
               do_syscall_64+0x2d/0x3a
               entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 6a39e62a ("lib: string.h: detect intra-object overflow in fortified string functions")
      Reported-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: Daniel Axtens <dja@axtens.net>
      26982a89
  12. Dec 08, 2020
    • David Howells's avatar
      afs: Fix memory leak when mounting with multiple source parameters · 4cb68296
      David Howells authored
      
      
      There's a memory leak in afs_parse_source() whereby multiple source=
      parameters overwrite fc->source in the fs_context struct without freeing
      the previously recorded source.
      
      Fix this by only permitting a single source parameter and rejecting with
      an error all subsequent ones.
      
      This was caught by syzbot with the kernel memory leak detector, showing
      something like the following trace:
      
        unreferenced object 0xffff888114375440 (size 32):
          comm "repro", pid 5168, jiffies 4294923723 (age 569.948s)
          backtrace:
            slab_post_alloc_hook+0x42/0x79
            __kmalloc_track_caller+0x125/0x16a
            kmemdup_nul+0x24/0x3c
            vfs_parse_fs_string+0x5a/0xa1
            generic_parse_monolithic+0x9d/0xc5
            do_new_mount+0x10d/0x15a
            do_mount+0x5f/0x8e
            __do_sys_mount+0xff/0x127
            do_syscall_64+0x2d/0x3a
            entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 13fcc683 ("afs: Add fs_context support")
      Reported-by: default avatar <syzbot+86dc6632faaca40133ab@syzkaller.appspotmail.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4cb68296
  13. Nov 22, 2020
    • David Howells's avatar
      afs: Fix speculative status fetch going out of order wrt to modifications · a9e5c87c
      David Howells authored
      
      
      When doing a lookup in a directory, the afs filesystem uses a bulk
      status fetch to speculatively retrieve the statuses of up to 48 other
      vnodes found in the same directory and it will then either update extant
      inodes or create new ones - effectively doing 'lookup ahead'.
      
      To avoid the possibility of deadlocking itself, however, the filesystem
      doesn't lock all of those inodes; rather just the directory inode is
      locked (by the VFS).
      
      When the operation completes, afs_inode_init_from_status() or
      afs_apply_status() is called, depending on whether the inode already
      exists, to commit the new status.
      
      A case exists, however, where the speculative status fetch operation may
      straddle a modification operation on one of those vnodes.  What can then
      happen is that the speculative bulk status RPC retrieves the old status,
      and whilst that is happening, the modification happens - which returns
      an updated status, then the modification status is committed, then we
      attempt to commit the speculative status.
      
      This results in something like the following being seen in dmesg:
      
      	kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus
      
      showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus
      say that the vnode had data version 8 when we'd already recorded version
      9 due to a local modification.  This was causing the cache to be
      invalidated for that vnode when it shouldn't have been.  If it happens
      on a data file, this might lead to local changes being lost.
      
      Fix this by ignoring speculative status updates if the data version
      doesn't match the expected value.
      
      Note that it is possible to get a DV regression if a volume gets
      restored from a backup - but we should get a callback break in such a
      case that should trigger a recheck anyway.  It might be worth checking
      the volume creation time in the volsync info and, if a change is
      observed in that (as would happen on a restore), invalidate all caches
      associated with the volume.
      
      Fixes: 5cf9dd55 ("afs: Prospectively look up extra files when doing a single lookup")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9e5c87c
  14. Nov 14, 2020
  15. Nov 03, 2020
    • David Howells's avatar
      afs: Fix incorrect freeing of the ACL passed to the YFS ACL store op · f4c79144
      David Howells authored
      
      
      The cleanup for the yfs_store_opaque_acl2_operation calls the wrong
      function to destroy the ACL content buffer.  It's an afs_acl struct, not
      a yfs_acl struct - and the free function for latter may pass invalid
      pointers to kfree().
      
      Fix this by using the afs_acl_put() function.  The yfs_acl_put()
      function is then no longer used and can be removed.
      
      	general protection fault, probably for non-canonical address 0x7ebde00000000: 0000 [#1] SMP PTI
      	...
      	RIP: 0010:compound_head+0x0/0x11
      	...
      	Call Trace:
      	 virt_to_cache+0x8/0x51
      	 kfree+0x5d/0x79
      	 yfs_free_opaque_acl+0x16/0x29
      	 afs_put_operation+0x60/0x114
      	 __vfs_setxattr+0x67/0x72
      	 __vfs_setxattr_noperm+0x66/0xe9
      	 vfs_setxattr+0x67/0xce
      	 setxattr+0x14e/0x184
      	 __do_sys_fsetxattr+0x66/0x8f
      	 do_syscall_64+0x2d/0x3a
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4c79144
    • David Howells's avatar
      afs: Fix warning due to unadvanced marshalling pointer · c80afa1d
      David Howells authored
      
      
      When using the afs.yfs.acl xattr to change an AuriStor ACL, a warning
      can be generated when the request is marshalled because the buffer
      pointer isn't increased after adding the last element, thereby
      triggering the check at the end if the ACL wasn't empty.  This just
      causes something like the following warning, but doesn't stop the call
      from happening successfully:
      
          kAFS: YFS.StoreOpaqueACL2: Request buffer underflow (36<108)
      
      Fix this simply by increasing the count prior to the check.
      
      Fixes: f5e45463 ("afs: Implement YFS ACL setting")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c80afa1d
  16. Oct 29, 2020
    • David Howells's avatar
      afs: Fix dirty-region encoding on ppc32 with 64K pages · 2d9900f2
      David Howells authored
      
      
      The dirty region bounds stored in page->private on an afs page are 15 bits
      on a 32-bit box and can, at most, represent a range of up to 32K within a
      32K page with a resolution of 1 byte.  This is a problem for powerpc32 with
      64K pages enabled.
      
      Further, transparent huge pages may get up to 2M, which will be a problem
      for the afs filesystem on all 32-bit arches in the future.
      
      Fix this by decreasing the resolution.  For the moment, a 64K page will
      have a resolution determined from PAGE_SIZE.  In the future, the page will
      need to be passed in to the helper functions so that the page size can be
      assessed and the resolution determined dynamically.
      
      Note that this might not be the ideal way to handle this, since it may
      allow some leakage of undirtied zero bytes to the server's copy in the case
      of a 3rd-party conflict.  Fixing that would require a separately allocated
      record and is a more complicated fix.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      2d9900f2
    • David Howells's avatar
      afs: Fix afs_invalidatepage to adjust the dirty region · f86726a6
      David Howells authored
      
      
      Fix afs_invalidatepage() to adjust the dirty region recorded in
      page->private when truncating a page.  If the dirty region is entirely
      removed, then the private data is cleared and the page dirty state is
      cleared.
      
      Without this, if the page is truncated and then expanded again by truncate,
      zeros from the expanded, but no-longer dirty region may get written back to
      the server if the page gets laundered due to a conflicting 3rd-party write.
      
      It mustn't, however, shorten the dirty region of the page if that page is
      still mmapped and has been marked dirty by afs_page_mkwrite(), so a flag is
      stored in page->private to record this.
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      f86726a6
    • David Howells's avatar
      afs: Alter dirty range encoding in page->private · 65dd2d60
      David Howells authored
      
      
      Currently, page->private on an afs page is used to store the range of
      dirtied data within the page, where the range includes the lower bound, but
      excludes the upper bound (e.g. 0-1 is a range covering a single byte).
      
      This, however, requires a superfluous bit for the last-byte bound so that
      on a 4KiB page, it can say 0-4096 to indicate the whole page, the idea
      being that having both numbers the same would indicate an empty range.
      This is unnecessary as the PG_private bit is clear if it's an empty range
      (as is PG_dirty).
      
      Alter the way the dirty range is encoded in page->private such that the
      upper bound is reduced by 1 (e.g. 0-0 is then specified the same single
      byte range mentioned above).
      
      Applying this to both bounds frees up two bits, one of which can be used in
      a future commit.
      
      This allows the afs filesystem to be compiled on ppc32 with 64K pages;
      without this, the following warnings are seen:
      
      ../fs/afs/internal.h: In function 'afs_page_dirty_to':
      ../fs/afs/internal.h:881:15: warning: right shift count >= width of type [-Wshift-count-overflow]
        881 |  return (priv >> __AFS_PAGE_PRIV_SHIFT) & __AFS_PAGE_PRIV_MASK;
            |               ^~
      ../fs/afs/internal.h: In function 'afs_page_dirty':
      ../fs/afs/internal.h:886:28: warning: left shift count >= width of type [-Wshift-count-overflow]
        886 |  return ((unsigned long)to << __AFS_PAGE_PRIV_SHIFT) | from;
            |                            ^~
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      65dd2d60
    • David Howells's avatar
      afs: Wrap page->private manipulations in inline functions · 185f0c70
      David Howells authored
      
      
      The afs filesystem uses page->private to store the dirty range within a
      page such that in the event of a conflicting 3rd-party write to the server,
      we write back just the bits that got changed locally.
      
      However, there are a couple of problems with this:
      
       (1) I need a bit to note if the page might be mapped so that partial
           invalidation doesn't shrink the range.
      
       (2) There aren't necessarily sufficient bits to store the entire range of
           data altered (say it's a 32-bit system with 64KiB pages or transparent
           huge pages are in use).
      
      So wrap the accesses in inline functions so that future commits can change
      how this works.
      
      Also move them out of the tracing header into the in-directory header.
      There's not really any need for them to be in the tracing header.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      185f0c70
    • David Howells's avatar
      afs: Fix where page->private is set during write · f792e3ac
      David Howells authored
      
      
      In afs, page->private is set to indicate the dirty region of a page.  This
      is done in afs_write_begin(), but that can't take account of whether the
      copy into the page actually worked.
      
      Fix this by moving the change of page->private into afs_write_end().
      
      Fixes: 4343d008 ("afs: Get rid of the afs_writeback record")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      f792e3ac
    • David Howells's avatar
      afs: Fix page leak on afs_write_begin() failure · 21db2cdc
      David Howells authored
      
      
      Fix the leak of the target page in afs_write_begin() when it fails.
      
      Fixes: 15b4650e ("afs: convert to new aops")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Nick Piggin <npiggin@gmail.com>
      21db2cdc
    • David Howells's avatar
      afs: Fix to take ref on page when PG_private is set · fa04a40b
      David Howells authored
      
      
      Fix afs to take a ref on a page when it sets PG_private on it and to drop
      the ref when removing the flag.
      
      Note that in afs_write_begin(), a lot of the time, PG_private is already
      set on a page to which we're going to add some data.  In such a case, we
      leave the bit set and mustn't increment the page count.
      
      As suggested by Matthew Wilcox, use attach/detach_page_private() where
      possible.
      
      Fixes: 31143d5d ("AFS: implement basic file write support")
      Reported-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      fa04a40b
Loading