Skip to content
  1. Jun 07, 2021
  2. Jun 06, 2021
  3. Jun 05, 2021
  4. Jun 03, 2021
    • Ritesh Harjani's avatar
      ext4: fix accessing uninit percpu counter variable with fast_commit · b45f189a
      Ritesh Harjani authored
      
      
      When running generic/527 with fast_commit configuration, the following
      issue is seen on Power.  With fast_commit, during ext4_fc_replay()
      (which can be called from ext4_fill_super()), if inode eviction
      happens then it can access an uninitialized percpu counter variable.
      
      This patch adds the check before accessing the counters in
      ext4_free_inode() path.
      
      [  321.165371] run fstests generic/527 at 2021-04-29 08:38:43
      [  323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
      [  323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
      [  323.619767] Faulting instruction address: 0xc000000000bae78c
      cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
          pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
          lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
          pid   = 5593, comm = mount
      	ext4_free_inode+0x780/0xb90
      	ext4_evict_inode+0xa8c/0xc60
      	evict+0xfc/0x1e0
      	ext4_fc_replay+0xc50/0x20f0
      	do_one_pass+0xfe0/0x1350
      	jbd2_journal_recover+0x184/0x2e0
      	jbd2_journal_load+0x1c0/0x4a0
      	ext4_fill_super+0x2458/0x4200
      	mount_bdev+0x1dc/0x290
      	ext4_mount+0x28/0x40
      	legacy_get_tree+0x4c/0xa0
      	vfs_get_tree+0x4c/0x120
      	path_mount+0xcf8/0xd70
      	do_mount+0x80/0xd0
      	sys_mount+0x3fc/0x490
      	system_call_exception+0x384/0x3d0
      	system_call_common+0xec/0x278
      
      Cc: stable@kernel.org
      Fixes: 8016e29f ("ext4: fast commit recovery path")
      Signed-off-by: default avatarRitesh Harjani <riteshh@linux.ibm.com>
      Reviewed-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/6cceb9a75c54bef8fa9696c1b08c8df5ff6169e2.1619692410.git.riteshh@linux.ibm.com
      
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b45f189a
  5. Jun 01, 2021
  6. May 31, 2021
  7. May 30, 2021
  8. May 27, 2021
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extents and low on available space · 76a6d5cd
      Filipe Manana authored
      
      
      There are a few cases where cloning an inline extent requires copying data
      into a page of the destination inode. For these cases we are allocating
      the required data and metadata space while holding a leaf locked. This can
      result in a deadlock when we are low on available space because allocating
      the space may flush delalloc and two deadlock scenarios can happen:
      
      1) When starting writeback for an inode with a very small dirty range that
         fits in an inline extent, we deadlock during the writeback when trying
         to insert the inline extent, at cow_file_range_inline(), if the extent
         is going to be located in the leaf for which we are already holding a
         read lock;
      
      2) After successfully starting writeback, for non-inline extent cases,
         the async reclaim thread will hang waiting for an ordered extent to
         complete if the ordered extent completion needs to modify the leaf
         for which the clone task is holding a read lock (for adding or
         replacing file extent items). So the cloning task will wait forever
         on the async reclaim thread to make progress, which in turn is
         waiting for the ordered extent completion which in turn is waiting
         to acquire a write lock on the same leaf.
      
      So fix this by making sure we release the path (and therefore the leaf)
      every time we need to copy the inline extent's data into a page of the
      destination inode, as by that time we do not need to have the leaf locked.
      
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76a6d5cd
    • Filipe Manana's avatar
      btrfs: fix fsync failure and transaction abort after writes to prealloc extents · ea7036de
      Filipe Manana authored
      
      
      When doing a series of partial writes to different ranges of preallocated
      extents with transaction commits and fsyncs in between, we can end up with
      a checksum items in a log tree. This causes an fsync to fail with -EIO and
      abort the transaction, turning the filesystem to RO mode, when syncing the
      log.
      
      For this to happen, we need to have a full fsync of a file following one
      or more fast fsyncs.
      
      The following example reproduces the problem and explains how it happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        # Create our test file with 2 preallocated extents. Leave a 1M hole
        # between them to ensure that we get two file extent items that will
        # never be merged into a single one. The extents are contiguous on disk,
        # which will later result in the checksums for their data to be merged
        # into a single checksum item in the csums btree.
        #
        $ xfs_io -f \
                 -c "falloc 0 1M" \
                 -c "falloc 3M 3M" \
                 /mnt/foobar
      
        # Now write to the second extent and leave only 1M of it as unwritten,
        # which corresponds to the file range [4M, 5M[.
        #
        # Then fsync the file to flush delalloc and to clear full sync flag from
        # the inode, so that a future fsync will use the fast code path.
        #
        # After the writeback triggered by the fsync we have 3 file extent items
        # that point to the second extent we previously allocated:
        #
        # 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [3M, 4M[
        #
        # 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
        #    the file range [4M, 5M[
        #
        # 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
        #    file range [5M, 6M[
        #
        # All these file extent items have a generation of 6, which is the ID of
        # the transaction where they were created. The split of the original file
        # extent item is done at btrfs_mark_extent_written() when ordered extents
        # complete for the file ranges [3M, 4M[ and [5M, 6M[.
        #
        $ xfs_io -c "pwrite -S 0xab 3M 1M" \
                 -c "pwrite -S 0xef 5M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Commit the current transaction. This wipes out the log tree created by
        # the previous fsync.
        sync
      
        # Now write to the unwritten range of the second extent we allocated,
        # corresponding to the file range [4M, 5M[, and fsync the file, which
        # triggers the fast fsync code path.
        #
        # The fast fsync code path sees that there is a new extent map covering
        # the file range [4M, 5M[ and therefore it will log a checksum item
        # covering the range [1M, 2M[ of the second extent we allocated.
        #
        # Also, after the fsync finishes we no longer have the 3 file extent
        # items that pointed to 3 sections of the second extent we allocated.
        # Instead we end up with a single file extent item pointing to the whole
        # extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
        # current transaction ID). This is due to the file extent item merging we
        # do when completing ordered extents into ranges that point to unwritten
        # (preallocated) extents. This merging is done at
        # btrfs_mark_extent_written().
        #
        $ xfs_io -c "pwrite -S 0xcd 4M 1M" \
                 -c "fsync" \
                 /mnt/foobar
      
        # Now do some write to our file outside the range of the second extent
        # that we allocated with fallocate() and truncate the file size from 6M
        # down to 5M.
        #
        # The truncate operation sets the full sync runtime flag on the inode,
        # forcing the next fsync to use the slow code path. It also changes the
        # length of the second file extent item so that it represents the file
        # range [3M, 5M[ and not the range [3M, 6M[ anymore.
        #
        # Finally fsync the file. Since this is a fsync that triggers the slow
        # code path, it will remove all items associated to the inode from the
        # log tree and then it will scan for file extent items in the
        # fs/subvolume tree that have a generation matching the current
        # transaction ID, which is 7. This means it will log 2 file extent
        # items:
        #
        # 1) One for the first extent we allocated, covering the file range
        #    [0, 1M[
        #
        # 2) Another for the first 2M of the second extent we allocated,
        #    covering the file range [3M, 5M[
        #
        # When logging the first file extent item we log a single checksum item
        # that has all the checksums for the entire extent.
        #
        # When logging the second file extent item, we also lookup for the
        # checksums that are associated with the range [0, 2M[ of the second
        # extent we allocated (file range [3M, 5M[), and then we log them with
        # btrfs_csum_file_blocks(). However that results in ending up with a log
        # that has two checksum items with ranges that overlap:
        #
        # 1) One for the range [1M, 2M[ of the second extent we allocated,
        #    corresponding to the file range [4M, 5M[, which we logged in the
        #    previous fsync that used the fast code path;
        #
        # 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
        #    extents, respectively, corresponding to the files ranges [0, 1M[
        #    and [3M, 5M[. This one was added during this last fsync that uses
        #    the slow code path and overlaps with the previous one logged by
        #    the previous fast fsync.
        #
        # This happens because when logging the checksums for the second
        # extent, we notice they start at an offset that matches the end of the
        # checksums item that we logged for the first extent, and because both
        # extents are contiguous on disk, btrfs_csum_file_blocks() decides to
        # extend that existing checksums item and append the checksums for the
        # second extent to this item. The end result is we end up with two
        # checksum items in the log tree that have overlapping ranges, as
        # listed before, resulting in the fsync to fail with -EIO and aborting
        # the transaction, turning the filesystem into RO mode.
        #
        $ xfs_io -c "pwrite -S 0xff 0 1M" \
                 -c "truncate 5M" \
                 -c "fsync" \
                 /mnt/foobar
        fsync: Input/output error
      
      After running the example, dmesg/syslog shows the tree checker complained
      about the checksum items with overlapping ranges and we aborted the
      transaction:
      
        $ dmesg
        (...)
        [756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
        [756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
        [756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
        [756289.563654] 	item 0 key (257 1 0) itemoff 16123 itemsize 160
        [756289.564649] 		inode generation 6 size 5242880 mode 100600
        [756289.565636] 	item 1 key (257 12 256) itemoff 16107 itemsize 16
        [756289.566694] 	item 2 key (257 108 0) itemoff 16054 itemsize 53
        [756289.567725] 		extent data disk bytenr 13631488 nr 1048576
        [756289.568697] 		extent data offset 0 nr 1048576 ram 1048576
        [756289.569689] 	item 3 key (257 108 1048576) itemoff 16001 itemsize 53
        [756289.570682] 		extent data disk bytenr 0 nr 0
        [756289.571363] 		extent data offset 0 nr 2097152 ram 2097152
        [756289.572213] 	item 4 key (257 108 3145728) itemoff 15948 itemsize 53
        [756289.573246] 		extent data disk bytenr 14680064 nr 3145728
        [756289.574121] 		extent data offset 0 nr 2097152 ram 3145728
        [756289.574993] 	item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
        [756289.576113] 	item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
        [756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
        [756289.578644] ------------[ cut here ]------------
        [756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
        [756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G        W         5.12.0-rc8-btrfs-next-87 #1
        [756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
        [756289.595122] Code: 5d c3 e8 76 60 (...)
        [756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
        [756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
        [756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
        [756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
        [756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
        [756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
        [756289.602278] FS:  00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
        [756289.603217] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
        [756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [756289.606400] Call Trace:
        [756289.606704]  btree_csum_one_bio+0x244/0x2b0 [btrfs]
        [756289.607313]  btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
        [756289.608040]  submit_one_bio+0x61/0x70 [btrfs]
        [756289.608587]  btree_write_cache_pages+0x587/0x610 [btrfs]
        [756289.609258]  ? free_debug_processing+0x1d5/0x240
        [756289.609812]  ? __module_address+0x28/0xf0
        [756289.610298]  ? lock_acquire+0x1a0/0x3e0
        [756289.610754]  ? lock_acquired+0x19f/0x430
        [756289.611220]  ? lock_acquire+0x1a0/0x3e0
        [756289.611675]  do_writepages+0x43/0xf0
        [756289.612101]  ? __filemap_fdatawrite_range+0xa4/0x100
        [756289.612800]  __filemap_fdatawrite_range+0xc5/0x100
        [756289.613393]  btrfs_write_marked_extents+0x68/0x160 [btrfs]
        [756289.614085]  btrfs_sync_log+0x21c/0xf20 [btrfs]
        [756289.614661]  ? finish_wait+0x90/0x90
        [756289.615096]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [756289.615661]  ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
        [756289.616338]  ? lock_acquire+0x1a0/0x3e0
        [756289.616801]  ? lock_acquired+0x19f/0x430
        [756289.617284]  ? lock_acquire+0x1a0/0x3e0
        [756289.617750]  ? lock_release+0x214/0x470
        [756289.618221]  ? lock_acquired+0x19f/0x430
        [756289.618704]  ? dput+0x20/0x4a0
        [756289.619079]  ? dput+0x20/0x4a0
        [756289.619452]  ? lockref_put_or_lock+0x9/0x30
        [756289.619969]  ? lock_release+0x214/0x470
        [756289.620445]  ? lock_release+0x214/0x470
        [756289.620924]  ? lock_release+0x214/0x470
        [756289.621415]  btrfs_sync_file+0x46a/0x5b0 [btrfs]
        [756289.621982]  do_fsync+0x38/0x70
        [756289.622395]  __x64_sys_fsync+0x10/0x20
        [756289.622907]  do_syscall_64+0x33/0x80
        [756289.623438]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [756289.624063] RIP: 0033:0x7f08b27fbb7b
        [756289.624588] Code: 0f 05 48 3d 00 (...)
        [756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
        [756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
        [756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
        [756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
        [756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
        [756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
        [756289.631819] irq event stamp: 0
        [756289.632188] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        [756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.633893] softirqs last  enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
        [756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
        [756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
        [756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
        [756289.637082] BTRFS info (device sdc): forced readonly
      
      Having checksum items covering ranges that overlap is dangerous as in some
      cases it can lead to having extent ranges for which we miss checksums
      after log replay or getting the wrong checksum item. There were some fixes
      in the past for bugs that resulted in this problem, and were explained and
      fixed by the following commits:
      
        27b9a812 ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
        b84b8390 ("Btrfs: fix file read corruption after extent cloning and fsync")
        40e046ac ("Btrfs: fix missing data checksums after replaying a log tree")
        e289f03e ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
      
      Fix the issue by making btrfs_csum_file_blocks() taking into account the
      start offset of the next checksum item when it decides to extend an
      existing checksum item, so that it never extends the checksum to end at a
      range that goes beyond the start range of the next checksum item.
      
      When we can not access the next checksum item without releasing the path,
      simply drop the optimization of extending the previous checksum item and
      fallback to inserting a new checksum item - this happens rarely and the
      optimization is not significant enough for a log tree in order to justify
      the extra complexity, as it would only save a few bytes (the size of a
      struct btrfs_item) of leaf space.
      
      This behaviour is only needed when inserting into a log tree because
      for the regular checksums tree we never have a case where we try to
      insert a range of checksums that overlap with a range that was previously
      inserted.
      
      A test case for fstests will follow soon.
      
      Reported-by: default avatarPhilipp Fent <fent@in.tum.de>
      Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
      
      
      CC: stable@vger.kernel.org # 5.4+
      Tested-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea7036de
    • Josef Bacik's avatar
      btrfs: abort in rename_exchange if we fail to insert the second ref · dc09ef35
      Josef Bacik authored
      
      
      Error injection stress uncovered a problem where we'd leave a dangling
      inode ref if we failed during a rename_exchange.  This happens because
      we insert the inode ref for one side of the rename, and then for the
      other side.  If this second inode ref insert fails we'll leave the first
      one dangling and leave a corrupt file system behind.  Fix this by
      aborting if we did the insert for the first inode ref.
      
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc09ef35
    • Josef Bacik's avatar
      btrfs: check error value from btrfs_update_inode in tree log · f96d4474
      Josef Bacik authored
      
      
      Error injection testing uncovered a case where we ended up with invalid
      link counts on an inode.  This happened because we failed to notice an
      error when updating the inode while replaying the tree log, and
      committed the transaction with an invalid file system.
      
      Fix this by checking the return value of btrfs_update_inode.  This
      resolved the link count errors I was seeing, and we already properly
      handle passing up the error values in these paths.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f96d4474
    • Josef Bacik's avatar
      btrfs: fixup error handling in fixup_inode_link_counts · 011b28ac
      Josef Bacik authored
      
      
      This function has the following pattern
      
      	while (1) {
      		ret = whatever();
      		if (ret)
      			goto out;
      	}
      	ret = 0
      out:
      	return ret;
      
      However several places in this while loop we simply break; when there's
      a problem, thus clearing the return value, and in one case we do a
      return -EIO, and leak the memory for the path.
      
      Fix this by re-arranging the loop to deal with ret == 1 coming from
      btrfs_search_slot, and then simply delete the
      
      	ret = 0;
      out:
      
      bit so everybody can break if there is an error, which will allow for
      proper error handling to occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      011b28ac
    • Josef Bacik's avatar
      btrfs: mark ordered extent and inode with error if we fail to finish · d61bec08
      Josef Bacik authored
      
      
      While doing error injection testing I saw that sometimes we'd get an
      abort that wouldn't stop the current transaction commit from completing.
      This abort was coming from finish ordered IO, but at this point in the
      transaction commit we should have gotten an error and stopped.
      
      It turns out the abort came from finish ordered io while trying to write
      out the free space cache.  It occurred to me that any failure inside of
      finish_ordered_io isn't actually raised to the person doing the writing,
      so we could have any number of failures in this path and think the
      ordered extent completed successfully and the inode was fine.
      
      Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
      marking the mapping of the inode with mapping_set_error, so any callers
      that simply call fdatawait will also get the error.
      
      With this we're seeing the IO error on the free space inode when we fail
      to do the finish_ordered_io.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d61bec08
    • Josef Bacik's avatar
      btrfs: return errors from btrfs_del_csums in cleanup_ref_head · 856bd270
      Josef Bacik authored
      
      
      We are unconditionally returning 0 in cleanup_ref_head, despite the fact
      that btrfs_del_csums could fail.  We need to return the error so the
      transaction gets aborted properly, fix this by returning ret from
      btrfs_del_csums in cleanup_ref_head.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      856bd270
    • Josef Bacik's avatar
      btrfs: fix error handling in btrfs_del_csums · b86652be
      Josef Bacik authored
      
      
      Error injection stress would sometimes fail with checksums on disk that
      did not have a corresponding extent.  This occurred because the pattern
      in btrfs_del_csums was
      
      	while (1) {
      		ret = btrfs_search_slot();
      		if (ret < 0)
      			break;
      	}
      	ret = 0;
      out:
      	btrfs_free_path(path);
      	return ret;
      
      If we got an error from btrfs_search_slot we'd clear the error because
      we were breaking instead of goto out.  Instead of using goto out, simply
      handle the cases where we may leave a random value in ret, and get rid
      of the
      
      	ret = 0;
      out:
      
      pattern and simply allow break to have the proper error reporting.  With
      this fix we properly abort the transaction and do not commit thinking we
      successfully deleted the csum.
      
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b86652be
    • Qu Wenruo's avatar
      btrfs: fix compressed writes that cross stripe boundary · 4c80a97d
      Qu Wenruo authored
      
      
      [BUG]
      When running btrfs/027 with "-o compress" mount option, it always
      crashes with the following call trace:
      
        BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/volumes.c:6651!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G           OE     5.13.0-rc2-custom+ #26
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
        RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs]
        Call Trace:
         btrfs_submit_compressed_write+0x2d7/0x470 [btrfs]
         submit_compressed_extents+0x3b0/0x470 [btrfs]
         ? mark_held_locks+0x49/0x70
         btrfs_work_helper+0x131/0x3e0 [btrfs]
         process_one_work+0x28f/0x5d0
         worker_thread+0x55/0x3c0
         ? process_one_work+0x5d0/0x5d0
         kthread+0x141/0x160
         ? __kthread_bind_mask+0x60/0x60
         ret_from_fork+0x22/0x30
        ---[ end trace 63113a3a91f34e68 ]---
      
      [CAUSE]
      The critical message before the crash means we have a bio at logical
      bytenr 298901504 length 12288, but only 8192 bytes can fit into one
      stripe, the remaining 4096 bytes go to another stripe.
      
      In btrfs, all bios are properly split to avoid cross stripe boundary,
      but commit 764c7c9a ("btrfs: zoned: fix parallel compressed writes")
      changed the behavior for compressed writes.
      
      Previously if we find our new page can't be fitted into current stripe,
      ie. "submit == 1" case, we submit current bio without adding current
      page.
      
             submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
      
         page->mapping = NULL;
         if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
             PAGE_SIZE) {
      
      But after the modification, we will add the page no matter if it crosses
      stripe boundary, leading to the above crash.
      
             submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
      
         if (pg_index == 0 && use_append)
                 len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
         else
                 len = bio_add_page(bio, page, PAGE_SIZE, 0);
      
         page->mapping = NULL;
         if (submit || len < PAGE_SIZE) {
      
      [FIX]
      It's no longer possible to revert to the original code style as we have
      two different bio_add_*_page() calls now.
      
      The new fix is to skip the bio_add_*_page() call if @submit is true.
      
      Also to avoid @len to be uninitialized, always initialize it to zero.
      
      If @submit is true, @len will not be checked.
      If @submit is not true, @len will be the return value of
      bio_add_*_page() call.
      Either way, the behavior is still the same as the old code.
      
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Fixes: 764c7c9a ("btrfs: zoned: fix parallel compressed writes")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c80a97d
    • Aurelien Aptel's avatar
      cifs: change format of CIFS_FULL_KEY_DUMP ioctl · 1bb56810
      Aurelien Aptel authored
      
      
      Make CIFS_FULL_KEY_DUMP ioctl able to return variable-length keys.
      
      * userspace needs to pass the struct size along with optional
        session_id and some space at the end to store keys
      * if there is enough space kernel returns keys in the extra space and
        sets the length of each key via xyz_key_length fields
      
      This also fixes the build error for get_user() on ARM.
      
      Sample program:
      
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <stdint.h>
      	#include <sys/fcntl.h>
      	#include <sys/ioctl.h>
      
      	struct smb3_full_key_debug_info {
      	        uint32_t   in_size;
      	        uint64_t   session_id;
      	        uint16_t   cipher_type;
      	        uint8_t    session_key_length;
      	        uint8_t    server_in_key_length;
      	        uint8_t    server_out_key_length;
      	        uint8_t    data[];
      	        /*
      	         * return this struct with the keys appended at the end:
      	         * uint8_t session_key[session_key_length];
      	         * uint8_t server_in_key[server_in_key_length];
      	         * uint8_t server_out_key[server_out_key_length];
      	         */
      	} __attribute__((packed));
      
      	#define CIFS_IOCTL_MAGIC 0xCF
      	#define CIFS_DUMP_FULL_KEY _IOWR(CIFS_IOCTL_MAGIC, 10, struct smb3_full_key_debug_info)
      
      	void dump(const void *p, size_t len) {
      	        const char *hex = "0123456789ABCDEF";
      	        const uint8_t *b = p;
      	        for (int i = 0; i < len; i++)
      	                printf("%c%c ", hex[(b[i]>>4)&0xf], hex[b[i]&0xf]);
      	        putchar('\n');
      	}
      
      	int main(int argc, char **argv)
      	{
      	        struct smb3_full_key_debug_info *keys;
      	        uint8_t buf[sizeof(*keys)+1024] = {0};
      	        size_t off = 0;
      	        int fd, rc;
      
      	        keys = (struct smb3_full_key_debug_info *)&buf;
      	        keys->in_size = sizeof(buf);
      
      	        fd = open(argv[1], O_RDONLY);
      	        if (fd < 0)
      	                perror("open"), exit(1);
      
      	        rc = ioctl(fd, CIFS_DUMP_FULL_KEY, keys);
      	        if (rc < 0)
      	                perror("ioctl"), exit(1);
      
      	        printf("SessionId      ");
      	        dump(&keys->session_id, 8);
      	        printf("Cipher         %04x\n", keys->cipher_type);
      
      	        printf("SessionKey     ");
      	        dump(keys->data+off, keys->session_key_length);
      	        off += keys->session_key_length;
      
      	        printf("ServerIn Key   ");
      	        dump(keys->data+off, keys->server_in_key_length);
      	        off += keys->server_in_key_length;
      
      	        printf("ServerOut Key  ");
      	        dump(keys->data+off, keys->server_out_key_length);
      
      	        return 0;
      	}
      
      Usage:
      
      	$ gcc -o dumpkeys dumpkeys.c
      
      Against Windows Server 2020 preview (with AES-256-GCM support):
      
      	# mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.0,seal"
      	# ./dumpkeys /mnt/somefile
      	SessionId      0D 00 00 00 00 0C 00 00
      	Cipher         0002
      	SessionKey     AB CD CC 0D E4 15 05 0C 6F 3C 92 90 19 F3 0D 25
      	ServerIn Key   73 C6 6A C8 6B 08 CF A2 CB 8E A5 7D 10 D1 5B DC
      	ServerOut Key  6D 7E 2B A1 71 9D D7 2B 94 7B BA C4 F0 A5 A4 F8
      	# umount /mnt
      
      	With 256 bit keys:
      
      	# echo 1 > /sys/module/cifs/parameters/require_gcm_256
      	# mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.11,seal"
      	# ./dumpkeys /mnt/somefile
      	SessionId      09 00 00 00 00 0C 00 00
      	Cipher         0004
      	SessionKey     93 F5 82 3B 2F B7 2A 50 0B B9 BA 26 FB 8C 8B 03
      	ServerIn Key   6C 6A 89 B2 CB 7B 78 E8 04 93 37 DA 22 53 47 DF B3 2C 5F 02 26 70 43 DB 8D 33 7B DC 66 D3 75 A9
      	ServerOut Key  04 11 AA D7 52 C7 A8 0F ED E3 93 3A 65 FE 03 AD 3F 63 03 01 2B C0 1B D7 D7 E5 52 19 7F CC 46 B4
      
      Signed-off-by: default avatarAurelien Aptel <aaptel@suse.com>
      Reviewed-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      1bb56810
    • Shyam Prasad N's avatar
      cifs: fix string declarations and assignments in tracepoints · eb068818
      Shyam Prasad N authored
      
      
      We missed using the variable length string macros in several
      tracepoints. Fixed them in this change.
      
      There's probably more useful macros that we can use to print
      others like flags etc. But I'll submit sepawrate patches for
      those at a future date.
      
      Signed-off-by: default avatarShyam Prasad N <sprasad@microsoft.com>
      Cc: <stable@vger.kernel.org> # v5.12
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      eb068818
    • Aurelien Aptel's avatar
      cifs: set server->cipher_type to AES-128-CCM for SMB3.0 · 6d2fcfe6
      Aurelien Aptel authored
      
      
      SMB3.0 doesn't have encryption negotiate context but simply uses
      the SMB2_GLOBAL_CAP_ENCRYPTION flag.
      
      When that flag is present in the neg response cifs.ko uses AES-128-CCM
      which is the only cipher available in this context.
      
      cipher_type was set to the server cipher only when parsing encryption
      negotiate context (SMB3.1.1).
      
      For SMB3.0 it was set to 0. This means cipher_type value can be 0 or 1
      for AES-128-CCM.
      
      Fix this by checking for SMB3.0 and encryption capability and setting
      cipher_type appropriately.
      
      Signed-off-by: default avatarAurelien Aptel <aaptel@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      6d2fcfe6
    • David Howells's avatar
      afs: Fix the nlink handling of dir-over-dir rename · f610a5a2
      David Howells authored
      
      
      Fix rename of one directory over another such that the nlink on the deleted
      directory is cleared to 0 rather than being decremented to 1.
      
      This was causing the generic/035 xfstest to fail.
      
      Fixes: e49c7b2f ("afs: Build an abstraction around an "operation" concept")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMarc Dionne <marc.dionne@auristor.com>
      cc: linux-afs@lists.infradead.org
      Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/
      
       # v1
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f610a5a2
    • Dave Chinner's avatar
      xfs: bunmapi has unnecessary AG lock ordering issues · 0fe0bbe0
      Dave Chinner authored
      large directory block size operations are assert failing because
      xfs_bunmapi() is not completely removing fragmented directory blocks
      like so:
      
      XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677
      ....
      Call Trace:
       xfs_dir2_shrink_inode+0x1a8/0x210
       xfs_dir2_block_to_sf+0x2ae/0x410
       xfs_dir2_block_removename+0x21a/0x280
       xfs_dir_removename+0x195/0x1d0
       xfs_rename+0xb79/0xc50
       ? avc_has_perm+0x8d/0x1a0
       ? avc_has_perm_noaudit+0x9a/0x120
       xfs_vn_rename+0xdb/0x150
       vfs_rename+0x719/0xb50
       ? __lookup_hash+0x6a/0xa0
       do_renameat2+0x413/0x5e0
       __x64_sys_rename+0x45/0x50
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      We are aborting the bunmapi() pass because of this specific chunk of
      code:
      
                      /*
                       * Make sure we don't touch multiple AGF headers out of order
                       * in a single transaction, as that could cause AB-BA deadlocks.
                       */
                      if (!wasdel && !isrt) {
                              agno = XFS_FSB_TO_AGNO(mp, del.br_startblock);
                              if (prev_agno != NULLAGNUMBER && prev_agno > agno)
                                      break;
                              prev_agno = agno;
                      }
      
      This is designed to prevent deadlocks in AGF locking when freeing
      multiple extents by ensuring that we only ever lock in increasing
      AG number order. Unfortunately, this also violates the "bunmapi will
      always succeed" semantic that some high level callers depend on,
      such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and
      xfs_inactive_symlink_rmt().
      
      This AG lock ordering was introduced back in 2017 to fix deadlocks
      triggered by generic/299 as reported here:
      
      https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/
      
      
      
      This codebase is old enough that it was before we were defering all
      AG based extent freeing from within xfs_bunmapi(). THat is, we never
      actually lock AGs in xfs_bunmapi() any more - every non-rt based
      extent free is added to the defer ops list, as is all BMBT block
      freeing. And RT extents are not RT based, so there's no lock
      ordering issues associated with them.
      
      Hence this AGF lock ordering code is both broken and dead. Let's
      just remove it so that the large directory block code works reliably
      again.
      
      Tested against xfs/538 and generic/299 which is the original test
      that exposed the deadlocks that this code fixed.
      
      Fixes: 5b094d6d ("xfs: fix multi-AG deadlock in xfs_bunmapi")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      0fe0bbe0
    • Dave Chinner's avatar
      xfs: btree format inode forks can have zero extents · 991c2c59
      Dave Chinner authored
      
      
      xfs/538 is assert failing with this trace when testing with
      directory block sizes of 64kB:
      
      XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608
      ....
      Call Trace:
       xfs_bmap_btree_to_extents+0x2a9/0x470
       ? kmem_cache_alloc+0xe7/0x220
       __xfs_bunmapi+0x4ca/0xdf0
       xfs_bunmapi+0x1a/0x30
       xfs_dir2_shrink_inode+0x71/0x210
       xfs_dir2_block_to_sf+0x2ae/0x410
       xfs_dir2_block_removename+0x21a/0x280
       xfs_dir_removename+0x195/0x1d0
       xfs_remove+0x244/0x460
       xfs_vn_unlink+0x53/0xa0
       ? selinux_inode_unlink+0x13/0x20
       vfs_unlink+0x117/0x220
       do_unlinkat+0x1a2/0x2d0
       __x64_sys_unlink+0x42/0x60
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This is a check to ensure that the extents have been read into
      memory before we are doing a ifork btree manipulation. This assert
      is bogus in the above case.
      
      We have a fragmented directory block that has more extents in it
      than can fit in extent format, so the inode data fork is in btree
      format. xfs_dir2_shrink_inode() asks to remove all remaining 16
      filesystem blocks from the inode so it can convert to short form,
      and __xfs_bunmapi() removes all the extents. We now have a data fork
      in btree format but have zero extents in the fork. This incorrectly
      trips the xfs_need_iread_extents() assert because it assumes that an
      empty extent btree means the extent tree has not been read into
      memory yet. This is clearly not the case with xfs_bunmapi(), as it
      has an explicit call to xfs_iread_extents() in it to pull the
      extents into memory before it starts unmapping.
      
      Also, the assert directly after this bogus one is:
      
      	ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE);
      
      Which covers the context in which it is legal to call
      xfs_bmap_btree_to_extents just fine. Hence we should just remove the
      bogus assert as it is clearly wrong and causes a regression.
      
      The returns the test behaviour to the pre-existing assert failure in
      xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to
      remove all the extents in the range it was asked to unmap.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      991c2c59
    • Marco Elver's avatar
      io_uring: fix data race to avoid potential NULL-deref · b16ef427
      Marco Elver authored
      
      
      Commit ba5ef6dc ("io_uring: fortify tctx/io_wq cleanup") introduced
      setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to
      detect a data race between accesses to tctx->io_wq:
      
        write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1:
         io_uring_clean_tctx                  fs/io_uring.c:9042 [inline]
         __io_uring_cancel                    fs/io_uring.c:9136
         io_uring_files_cancel                include/linux/io_uring.h:16 [inline]
         do_exit                              kernel/exit.c:781
         do_group_exit                        kernel/exit.c:923
         get_signal                           kernel/signal.c:2835
         arch_do_signal_or_restart            arch/x86/kernel/signal.c:789
         handle_signal_work                   kernel/entry/common.c:147 [inline]
         exit_to_user_mode_loop               kernel/entry/common.c:171 [inline]
         ...
        read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0:
         io_uring_try_cancel_iowq             fs/io_uring.c:8911 [inline]
         io_uring_try_cancel_requests         fs/io_uring.c:8933
         io_ring_exit_work                    fs/io_uring.c:8736
         process_one_work                     kernel/workqueue.c:2276
         ...
      
      With the config used, KCSAN only reports data races with value changes:
      this implies that in the case here we also know that tctx->io_wq was
      non-NULL. Therefore, depending on interleaving, we may end up with:
      
                    [CPU 0]                 |        [CPU 1]
        io_uring_try_cancel_iowq()          | io_uring_clean_tctx()
          if (!tctx->io_wq) // false        |   ...
          ...                               |   tctx->io_wq = NULL
          io_wq_cancel_cb(tctx->io_wq, ...) |   ...
            -> NULL-deref                   |
      
      Note: It is likely that thus far we've gotten lucky and the compiler
      optimizes the double-read into a single read into a register -- but this
      is never guaranteed, and can easily change with a different config!
      
      Fix the data race by restoring the previous behaviour, where both
      setting io_wq to NULL and put of the wq are _serialized_ after
      concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock
      and removal of the node in io_uring_del_task_file().
      
      Fixes: ba5ef6dc ("io_uring: fortify tctx/io_wq cleanup")
      Suggested-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reported-by: default avatar <syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com>
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b16ef427
    • Huilong Deng's avatar
      nfs: Remove trailing semicolon in macros · a799b68a
      Huilong Deng authored
      
      
      Macros should not use a trailing semicolon.
      
      Signed-off-by: default avatarHuilong Deng <denghuilong@cdjrlc.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      a799b68a
    • Zhang Xiaoxu's avatar
      NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config · e67afa7e
      Zhang Xiaoxu authored
      
      
      Since commit bdcc2cd1 ("NFSv4.2: handle NFS-specific llseek errors"),
      nfs42_proc_llseek would return -EOPNOTSUPP rather than -ENOTSUPP when
      SEEK_DATA on NFSv4.0/v4.1.
      
      This will lead xfstests generic/285 not run on NFSv4.0/v4.1 when set the
      CONFIG_NFS_V4_2, rather than run failed.
      
      Fixes: bdcc2cd1 ("NFSv4.2: handle NFS-specific llseek errors")
      Cc: <stable.vger.kernel.org> # 4.2
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      e67afa7e
  9. May 26, 2021
    • Zqiang's avatar
      io-wq: Fix UAF when wakeup wqe in hash waitqueue · 3743c172
      Zqiang authored
      
      
      BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650
      Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802
      
      Call Trace:
       __dump_stack [inline]
       dump_stack+0x141/0x1d7
       print_address_description.constprop.0.cold+0x5b/0x2c6
       __kasan_report [inline]
       kasan_report.cold+0x7c/0xd8
       __wake_up_common+0x637/0x650
       __wake_up_common_lock+0xd0/0x130
       io_worker_handle_work+0x9dd/0x1790
       io_wqe_worker+0xb2a/0xd40
       ret_from_fork+0x1f/0x30
      
      Allocated by task 28798:
       kzalloc_node [inline]
       io_wq_create+0x3c4/0xdd0
       io_init_wq_offload [inline]
       io_uring_alloc_task_context+0x1bf/0x6b0
       __io_uring_add_task_file+0x29a/0x3c0
       io_uring_add_task_file [inline]
       io_uring_install_fd [inline]
       io_uring_create [inline]
       io_uring_setup+0x209a/0x2bd0
       do_syscall_64+0x3a/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Freed by task 28798:
       kfree+0x106/0x2c0
       io_wq_destroy+0x182/0x380
       io_wq_put [inline]
       io_wq_put_and_exit+0x7a/0xa0
       io_uring_clean_tctx [inline]
       __io_uring_cancel+0x428/0x530
       io_uring_files_cancel
       do_exit+0x299/0x2a60
       do_group_exit+0x125/0x310
       get_signal+0x47f/0x2150
       arch_do_signal_or_restart+0x2a8/0x1eb0
       handle_signal_work[inline]
       exit_to_user_mode_loop [inline]
       exit_to_user_mode_prepare+0x171/0x280
       __syscall_exit_to_user_mode_work [inline]
       syscall_exit_to_user_mode+0x19/0x60
       do_syscall_64+0x47/0xb0
       entry_SYSCALL_64_after_hwframe
      
      There are the following scenarios, hash waitqueue is shared by
      io-wq1 and io-wq2. (note: wqe is worker)
      
      io-wq1:worker2     | locks bit1
      io-wq2:worker1     | waits bit1
      io-wq1:worker3     | waits bit1
      
      io-wq1:worker2     | completes all wqe bit1 work items
      io-wq1:worker2     | drop bit1, exit
      
      io-wq2:worker1     | locks bit1
      io-wq1:worker3     | can not locks bit1, waits bit1 and exit
      io-wq1             | exit and free io-wq1
      io-wq2:worker1     | drops bit1
      io-wq1:worker3     | be waked up, even though wqe is freed
      
      After all iou-wrk belonging to io-wq1 have exited, remove wqe
      form hash waitqueue, it is guaranteed that there will be no more
      wqe belonging to io-wq1 in the hash waitqueue.
      
      Reported-by: default avatar <syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com>
      Signed-off-by: default avatarZqiang <qiang.zhang@windriver.com>
      Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3743c172
    • Trond Myklebust's avatar
      NFS: Clean up reset of the mirror accounting variables · 70536bf4
      Trond Myklebust authored
      
      
      Now that nfs_pageio_do_add_request() resets the pg_count, we don't need
      these other inlined resets.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      70536bf4
    • Trond Myklebust's avatar
      NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() · 0d0ea309
      Trond Myklebust authored
      
      
      The value of mirror->pg_bytes_written should only be updated after a
      successful attempt to flush out the requests on the list.
      
      Fixes: a7d42ddb ("nfs: add mirroring support to pgio layer")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      0d0ea309
    • Trond Myklebust's avatar
      NFS: Fix an Oopsable condition in __nfs_pageio_add_request() · 56517ab9
      Trond Myklebust authored
      
      
      Ensure that nfs_pageio_error_cleanup() resets the mirror array contents,
      so that the structure reflects the fact that it is now empty.
      Also change the test in nfs_pageio_do_add_request() to be more robust by
      checking whether or not the list is empty rather than relying on the
      value of pg_count.
      
      Fixes: a7d42ddb ("nfs: add mirroring support to pgio layer")
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      56517ab9
    • Pavel Begunkov's avatar
      io_uring/io-wq: close io-wq full-stop gap · 17a91051
      Pavel Begunkov authored
      
      
      There is an old problem with io-wq cancellation where requests should be
      killed and are in io-wq but are not discoverable, e.g. in @next_hashed
      or @linked vars of io_worker_handle_work(). It adds some unreliability
      to individual request canellation, but also may potentially get
      __io_uring_cancel() stuck. For instance:
      
      1) An __io_uring_cancel()'s cancellation round have not found any
         request but there are some as desribed.
      2) __io_uring_cancel() goes to sleep
      3) Then workers wake up and try to execute those hidden requests
         that happen to be unbound.
      
      As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT
      in advance, so preventing 3) from executing unbound requests. The
      workers will initially break looping because of getting a signal as they
      are threads of the dying/exec()'ing user task.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      17a91051
  10. May 25, 2021
Loading