Skip to content
  1. Jul 13, 2020
  2. Jul 12, 2020
  3. Jul 10, 2020
    • Jens Axboe's avatar
      io_uring: account user memory freed when exit has been queued · 309fc03a
      Jens Axboe authored
      
      
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8 ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      309fc03a
    • Yang Yingliang's avatar
      io_uring: fix memleak in io_sqe_files_register() · 667e57da
      Yang Yingliang authored
      
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      667e57da
  4. Jul 09, 2020
    • Christoph Hellwig's avatar
      btrfs: wire up iter_file_splice_write · d7776591
      Christoph Hellwig authored
      
      
      btrfs implements the iter_write op and thus can use the more efficient
      iov_iter based splice implementation.  For now falling back to the less
      efficient default is pretty harmless, but I have a pending series that
      removes the default, and thus would cause btrfs to not support splice
      at all.
      
      Reported-by: default avatarAndy Lavr <andy.lavr@gmail.com>
      Tested-by: default avatarAndy Lavr <andy.lavr@gmail.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7776591
    • Josef Bacik's avatar
      btrfs: fix double put of block group with nocow · 230ed397
      Josef Bacik authored
      
      
      While debugging a patch that I wrote I was hitting use-after-free panics
      when accessing block groups on unmount.  This turned out to be because
      in the nocow case if we bail out of doing the nocow for whatever reason
      we need to call btrfs_dec_nocow_writers() if we called the inc.  This
      puts our block group, but a few error cases does
      
      if (nocow) {
          btrfs_dec_nocow_writers();
          goto error;
      }
      
      unfortunately, error is
      
      error:
      	if (nocow)
      		btrfs_dec_nocow_writers();
      
      so we get a double put on our block group.  Fix this by dropping the
      error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
      error label now.
      
      Fixes: 762bf098 ("btrfs: improve error handling in run_delalloc_nocow")
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      230ed397
    • Steve French's avatar
      cifs: update internal module version number · a8dab63e
      Steve French authored
      
      
              To 2.28
      
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      a8dab63e
    • Ronnie Sahlberg's avatar
      cifs: fix reference leak for tlink · a77592a7
      Ronnie Sahlberg authored
      
      
      Don't leak a reference to tlink during the NOTIFY ioctl
      
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarAurelien Aptel <aaptel@suse.com>
      CC: Stable <stable@vger.kernel.org> # v5.6+
      a77592a7
    • Yang Yingliang's avatar
      io_uring: fix memleak in __io_sqe_files_update() · f3bd9dae
      Yang Yingliang authored
      
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e60 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f3bd9dae
    • Xiaoguang Wang's avatar
      io_uring: export cq overflow status to userspace · 6d5f9049
      Xiaoguang Wang authored
      
      
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      
      Signed-off-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d5f9049
  5. Jul 08, 2020
  6. Jul 07, 2020
    • Steve French's avatar
      smb3: fix access denied on change notify request to some servers · 4ef9b4f1
      Steve French authored
      
      
      read permission, not just read attributes permission, is required
      on the directory.
      
      See MS-SMB2 (protocol specification) section 3.3.5.19.
      
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org> # v5.6+
      Reviewed-by: default avatarPavel Shilovsky <pshilov@microsoft.com>
      4ef9b4f1
    • Andreas Gruenbacher's avatar
      gfs2: Rework read and page fault locking · 20f82999
      Andreas Gruenbacher authored
      
      
      So far, gfs2 has taken the inode glocks inside the ->readpage and
      ->readahead address space operations.  Since commit d4388340 ("fs:
      convert mpage_readpages to mpage_readahead"), gfs2_readahead is passed
      the pages to read ahead locked.  With that, the current holder of the
      inode glock may be trying to lock one of those pages while
      gfs2_readahead is trying to take the inode glock, resulting in a
      deadlock.
      
      Fix that by moving the lock taking to the higher-level ->read_iter file
      and ->fault vm operations.  This also gets rid of an ugly lock inversion
      workaround in gfs2_readpage.
      
      The cache consistency model of filesystems like gfs2 is such that if
      data is found in the page cache, the data is up to date and can be used
      without taking any filesystem locks.  If a page is not cached,
      filesystem locks must be taken before populating the page cache.
      
      To avoid taking the inode glock when the data is already cached,
      gfs2_file_read_iter first tries to read the data with the IOCB_NOIO flag
      set.  If that fails, the inode glock is taken and the operation is
      retried with the IOCB_NOIO flag cleared.
      
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      20f82999
    • Qu Wenruo's avatar
      btrfs: discard: add missing put when grabbing block group from unused list · 04e484c5
      Qu Wenruo authored
      
      
      [BUG]
      The following small test script can trigger ASSERT() at unmount time:
      
        mkfs.btrfs -f $dev
        mount $dev $mnt
        mount -o remount,discard=async $mnt
        umount $mnt
      
      The call trace:
        assertion failed: atomic_read(&block_group->count) == 1, in fs/btrfs/block-group.c:3431
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3204!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 4 PID: 10389 Comm: umount Tainted: G           O      5.8.0-rc3-custom+ #68
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         btrfs_free_block_groups.cold+0x22/0x55 [btrfs]
         close_ctree+0x2cb/0x323 [btrfs]
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The code:
                      ASSERT(atomic_read(&block_group->count) == 1);
                      btrfs_put_block_group(block_group);
      
      [CAUSE]
      Obviously it's some btrfs_get_block_group() call doesn't get its put
      call.
      
      The offending btrfs_get_block_group() happens here:
      
        void btrfs_mark_bg_unused(struct btrfs_block_group *bg)
        {
        	if (list_empty(&bg->bg_list)) {
        		btrfs_get_block_group(bg);
      		list_add_tail(&bg->bg_list, &fs_info->unused_bgs);
        	}
        }
      
      So every call sites removing the block group from unused_bgs list should
      reduce the ref count of that block group.
      
      However for async discard, it didn't follow the call convention:
      
        void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info)
        {
        	list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs,
        				 bg_list) {
        		list_del_init(&block_group->bg_list);
        		btrfs_discard_queue_work(&fs_info->discard_ctl, block_group);
        	}
        }
      
      And in btrfs_discard_queue_work(), it doesn't call
      btrfs_put_block_group() either.
      
      [FIX]
      Fix the problem by reducing the reference count when we grab the block
      group from unused_bgs list.
      
      Reported-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Fixes: 6e80d4f8 ("btrfs: handle empty block_group removal for async discard")
      CC: stable@vger.kernel.org # 5.6+
      Tested-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04e484c5
  7. Jul 04, 2020
    • Jens Axboe's avatar
      io_uring: fix regression with always ignoring signals in io_cqring_wait() · b7db41c9
      Jens Axboe authored
      
      
      When switching to TWA_SIGNAL for task_work notifications, we also made
      any signal based condition in io_cqring_wait() return -ERESTARTSYS.
      This breaks applications that rely on using signals to abort someone
      waiting for events.
      
      Check if we have a signal pending because of queued task_work, and
      repeat the signal check once we've run the task_work. This provides a
      reliable way of telling the two apart.
      
      Additionally, only use TWA_SIGNAL if we are using an eventfd. If not,
      we don't have the dependency situation described in the original commit,
      and we can get by with just using TWA_RESUME like we previously did.
      
      Fixes: ce593a6c ("io_uring: use signal based task_work running")
      Cc: stable@vger.kernel.org # v5.7
      Reported-by: default avatarAndres Freund <andres@anarazel.de>
      Tested-by: default avatarAndres Freund <andres@anarazel.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7db41c9
  8. Jul 03, 2020
    • Matthew Wilcox (Oracle)'s avatar
      Call sysctl_head_finish on error · d4d80e69
      Matthew Wilcox (Oracle) authored
      
      
      This error path returned directly instead of calling sysctl_head_finish().
      
      Fixes: ef9d965b ("sysctl: reject gigantic reads/write to sysctl files")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      d4d80e69
    • Bob Peterson's avatar
      gfs2: The freeze glock should never be frozen · c860f8ff
      Bob Peterson authored
      
      
      Before this patch, some gfs2 code locked the freeze glock with LM_FLAG_NOEXP
      (Do not freeze) flag, and some did not. We never want to freeze the freeze
      glock, so this patch makes it consistently use LM_FLAG_NOEXP always.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      c860f8ff
    • Bob Peterson's avatar
      gfs2: When freezing gfs2, use GL_EXACT and not GL_NOCACHE · 623ba664
      Bob Peterson authored
      
      
      Before this patch, the freeze code in gfs2 specified GL_NOCACHE in
      several places. That's wrong because we always want to know the state
      of whether the file system is frozen.
      
      There was also a problem with freeze/thaw transitioning the glock from
      frozen (EX) to thawed (SH) because gfs2 will normally grant glocks in EX
      to processes that request it in SH mode, unless GL_EXACT is specified.
      Therefore, the freeze/thaw code, which tried to reacquire the glock in
      SH mode would get the glock in EX mode, and miss the transition from EX
      to SH. That made it think the thaw had completed normally, but since the
      glock was still cached in EX, other nodes could not freeze again.
      
      This patch removes the GL_NOCACHE flag to allow the freeze glock to be
      cached. It also adds the GL_EXACT flag so the glock is fully transitioned
      from EX to SH, thereby allowing future freeze operations.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      623ba664
    • Bob Peterson's avatar
      gfs2: read-only mounts should grab the sd_freeze_gl glock · b780cc61
      Bob Peterson authored
      
      
      Before this patch, only read-write mounts would grab the freeze
      glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
      glock was never initialized. That meant requests to freeze, which
      request the glock in EX, were granted without any state transition.
      That meant you could mount a gfs2 file system, which is currently
      frozen on a different cluster node, in read-only mode.
      
      This patch makes read-only mounts lock the freeze glock in SH mode,
      which will block for file systems that are frozen on another node.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      b780cc61
    • Bob Peterson's avatar
      gfs2: freeze should work on read-only mounts · 541656d3
      Bob Peterson authored
      
      
      Before this patch, function freeze_go_sync, called when promoting
      the freeze glock, was testing for the SDF_JOURNAL_LIVE superblock flag.
      That's only set for read-write mounts. Read-only mounts don't use a
      journal, so the bit is never set, so the freeze never happened.
      
      This patch removes the check for SDF_JOURNAL_LIVE for freeze requests
      but still checks it when deciding whether to flush a journal.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      541656d3
    • Bob Peterson's avatar
      gfs2: eliminate GIF_ORDERED in favor of list_empty · 7542486b
      Bob Peterson authored
      
      
      In several places, we used the GIF_ORDERED inode flag to determine
      if an inode was on the ordered writes list. However, since we always
      held the sd_ordered_lock spin_lock during the manipulation, we can
      just as easily check list_empty(&ip->i_ordered) instead.
      This allows us to keep more than one ordered writes list to make
      journal writing improvements.
      
      This patch eliminates GIF_ORDERED in favor of checking list_empty.
      
      Signed-off-by: default avatarBob Peterson <rpeterso@redhat.com>
      7542486b
  9. Jul 02, 2020
    • Josef Bacik's avatar
      btrfs: reset tree root pointer after error in init_tree_roots · 0465337c
      Josef Bacik authored
      
      
      Eric reported an issue where mounting -o recovery with a fuzzed fs
      resulted in a kernel panic.  This is because we tried to free the tree
      node, except it was an error from the read.  Fix this by properly
      resetting the tree_root->node == NULL in this case.  The panic was the
      following
      
        BTRFS warning (device loop0): failed to read tree root
        BUG: kernel NULL pointer dereference, address: 000000000000001f
        RIP: 0010:free_extent_buffer+0xe/0x90 [btrfs]
        Call Trace:
         free_root_extent_buffers.part.0+0x11/0x30 [btrfs]
         free_root_pointers+0x1a/0xa2 [btrfs]
         open_ctree+0x1776/0x18a5 [btrfs]
         btrfs_mount_root.cold+0x13/0xfa [btrfs]
         ? selinux_fs_context_parse_param+0x37/0x80
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         fc_mount+0xe/0x30
         vfs_kern_mount.part.0+0x71/0x90
         btrfs_mount+0x147/0x3e0 [btrfs]
         ? cred_has_capability+0x7c/0x120
         ? legacy_get_tree+0x27/0x40
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         do_mount+0x735/0xa40
         __x64_sys_mount+0x8e/0xd0
         do_syscall_64+0x4d/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Nik says: this is problematic only if we fail on the last iteration of
      the loop as this results in init_tree_roots returning err value with
      tree_root->node = -ERR. Subsequently the caller does: fail_tree_roots
      which calls free_root_pointers on the bogus value.
      
      Reported-by: default avatarEric Sandeen <sandeen@redhat.com>
      Fixes: b8522a1e ("btrfs: Factor out tree roots initialization during mount")
      CC: stable@vger.kernel.org # 5.5+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add details how the pointer gets dereferenced ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0465337c
    • Filipe Manana's avatar
      btrfs: fix reclaim_size counter leak after stealing from global reserve · 6d548b9e
      Filipe Manana authored
      
      
      Commit 7f9fe614 ("btrfs: improve global reserve stealing logic"),
      added in the 5.8 merge window, introduced another leak for the space_info's
      reclaim_size counter. This is very often triggered by the test cases
      generic/269 and generic/416 from fstests, producing a stack trace like the
      following during unmount:
      
      [37079.155499] ------------[ cut here ]------------
      [37079.156844] WARNING: CPU: 2 PID: 2000423 at fs/btrfs/block-group.c:3422 btrfs_free_block_groups+0x2eb/0x300 [btrfs]
      [37079.158090] Modules linked in: dm_snapshot btrfs dm_thin_pool (...)
      [37079.164440] CPU: 2 PID: 2000423 Comm: umount Tainted: G        W         5.7.0-rc7-btrfs-next-62 #1
      [37079.165422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
      [37079.167384] RIP: 0010:btrfs_free_block_groups+0x2eb/0x300 [btrfs]
      [37079.168375] Code: bd 58 ff ff ff 00 4c 8d (...)
      [37079.170199] RSP: 0018:ffffaa53875c7de0 EFLAGS: 00010206
      [37079.171120] RAX: ffff98099e701cf8 RBX: ffff98099e2d4000 RCX: 0000000000000000
      [37079.172057] RDX: 0000000000000001 RSI: ffffffffc0acc5b1 RDI: 00000000ffffffff
      [37079.173002] RBP: ffff98099e701cf8 R08: 0000000000000000 R09: 0000000000000000
      [37079.173886] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98099e701c00
      [37079.174730] R13: ffff98099e2d5100 R14: dead000000000122 R15: dead000000000100
      [37079.175578] FS:  00007f4d7d0a5840(0000) GS:ffff9809ec600000(0000) knlGS:0000000000000000
      [37079.176434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [37079.177289] CR2: 0000559224dcc000 CR3: 000000012207a004 CR4: 00000000003606e0
      [37079.178152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [37079.178935] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [37079.179675] Call Trace:
      [37079.180419]  close_ctree+0x291/0x2d1 [btrfs]
      [37079.181162]  generic_shutdown_super+0x6c/0x100
      [37079.181898]  kill_anon_super+0x14/0x30
      [37079.182641]  btrfs_kill_super+0x12/0x20 [btrfs]
      [37079.183371]  deactivate_locked_super+0x31/0x70
      [37079.184012]  cleanup_mnt+0x100/0x160
      [37079.184650]  task_work_run+0x68/0xb0
      [37079.185284]  exit_to_usermode_loop+0xf9/0x100
      [37079.185920]  do_syscall_64+0x20d/0x260
      [37079.186556]  entry_SYSCALL_64_after_hwframe+0x49/0xb3
      [37079.187197] RIP: 0033:0x7f4d7d2d9357
      [37079.187836] Code: eb 0b 00 f7 d8 64 89 01 48 (...)
      [37079.189180] RSP: 002b:00007ffee4e0d368 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [37079.189845] RAX: 0000000000000000 RBX: 00007f4d7d3fb224 RCX: 00007f4d7d2d9357
      [37079.190515] RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000559224dc5c90
      [37079.191173] RBP: 0000559224dc1970 R08: 0000000000000000 R09: 00007ffee4e0c0e0
      [37079.191815] R10: 0000559224dc7b00 R11: 0000000000000246 R12: 0000000000000000
      [37079.192451] R13: 0000559224dc5c90 R14: 0000559224dc1a80 R15: 0000559224dc1ba0
      [37079.193096] irq event stamp: 0
      [37079.193729] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [37079.194379] hardirqs last disabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
      [37079.195033] softirqs last  enabled at (0): [<ffffffff97ab8935>] copy_process+0x755/0x1ea0
      [37079.195700] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [37079.196318] ---[ end trace b32710d864dea887 ]---
      
      In the past commit d611add4 ("btrfs: fix reclaim counter leak of
      space_info objects") fixed similar cases. That commit however has a date
      more recent (April 7 2020) then the commit mentioned before (March 13
      2020), however it was merged in kernel 5.7 while the older commit, which
      introduces a new leak, was merged only in the 5.8 merge window. So the
      leak sneaked in unnoticed.
      
      Fix this by making steal_from_global_rsv() remove the ticket using the
      helper remove_ticket(), which decrements the reclaim_size counter of the
      space_info object.
      
      Fixes: 7f9fe614 ("btrfs: improve global reserve stealing logic")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d548b9e
    • Boris Burkov's avatar
      btrfs: fix fatal extent_buffer readahead vs releasepage race · 6bf9cd2e
      Boris Burkov authored
      
      
      Under somewhat convoluted conditions, it is possible to attempt to
      release an extent_buffer that is under io, which triggers a BUG_ON in
      btrfs_release_extent_buffer_pages.
      
      This relies on a few different factors. First, extent_buffer reads done
      as readahead for searching use WAIT_NONE, so they free the local extent
      buffer reference while the io is outstanding. However, they should still
      be protected by TREE_REF. However, if the system is doing signficant
      reclaim, and simultaneously heavily accessing the extent_buffers, it is
      possible for releasepage to race with two concurrent readahead attempts
      in a way that leaves TREE_REF unset when the readahead extent buffer is
      released.
      
      Essentially, if two tasks race to allocate a new extent_buffer, but the
      winner who attempts the first io is rebuffed by a page being locked
      (likely by the reclaim itself) then the loser will still go ahead with
      issuing the readahead. The loser's call to find_extent_buffer must also
      race with the reclaim task reading the extent_buffer's refcount as 1 in
      a way that allows the reclaim to re-clear the TREE_REF checked by
      find_extent_buffer.
      
      The following represents an example execution demonstrating the race:
      
                  CPU0                                                         CPU1                                           CPU2
      reada_for_search                                            reada_for_search
        readahead_tree_block                                        readahead_tree_block
          find_create_tree_block                                      find_create_tree_block
            alloc_extent_buffer                                         alloc_extent_buffer
                                                                        find_extent_buffer // not found
                                                                        allocates eb
                                                                        lock pages
                                                                        associate pages to eb
                                                                        insert eb into radix tree
                                                                        set TREE_REF, refs == 2
                                                                        unlock pages
                                                                    read_extent_buffer_pages // WAIT_NONE
                                                                      not uptodate (brand new eb)
                                                                                                                  lock_page
                                                                      if !trylock_page
                                                                        goto unlock_exit // not an error
                                                                    free_extent_buffer
                                                                      release_extent_buffer
                                                                        atomic_dec_and_test refs to 1
              find_extent_buffer // found
                                                                                                                  try_release_extent_buffer
                                                                                                                    take refs_lock
                                                                                                                    reads refs == 1; no io
                atomic_inc_not_zero refs to 2
                mark_buffer_accessed
                  check_buffer_tree_ref
                    // not STALE, won't take refs_lock
                    refs == 2; TREE_REF set // no action
          read_extent_buffer_pages // WAIT_NONE
                                                                                                                    clear TREE_REF
                                                                                                                    release_extent_buffer
                                                                                                                      atomic_dec_and_test refs to 1
                                                                                                                      unlock_page
            still not uptodate (CPU1 read failed on trylock_page)
            locks pages
            set io_pages > 0
            submit io
            return
          free_extent_buffer
            release_extent_buffer
              dec refs to 0
              delete from radix tree
              btrfs_release_extent_buffer_pages
                BUG_ON(io_pages > 0)!!!
      
      We observe this at a very low rate in production and were also able to
      reproduce it in a test environment by introducing some spurious delays
      and by introducing probabilistic trylock_page failures.
      
      To fix it, we apply check_tree_ref at a point where it could not
      possibly be unset by a competing task: after io_pages has been
      incremented. All the codepaths that clear TREE_REF check for io, so they
      would not be able to clear it after this point until the io is done.
      
      Stack trace, for reference:
      [1417839.424739] ------------[ cut here ]------------
      [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
      [1417839.447024] invalid opcode: 0000 [#1] SMP
      [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
      [1417839.517008] Code: ed e9 ...
      [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
      [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
      [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
      [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
      [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
      [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
      [1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
      [1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
      [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [1417839.731320] Call Trace:
      [1417839.737103]  release_extent_buffer+0x39/0x90
      [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
      [1417839.758645]  btrfs_search_slot+0x260/0x9b0
      [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
      [1417839.778427]  btrfs_get_extent+0x15f/0x830
      [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
      [1417839.797474]  ? __do_readpage+0x299/0x7a0
      [1417839.806515]  __do_readpage+0x33b/0x7a0
      [1417839.815171]  ? btrfs_releasepage+0x70/0x70
      [1417839.824597]  extent_readpages+0x28f/0x400
      [1417839.833836]  read_pages+0x6a/0x1c0
      [1417839.841729]  ? startup_64+0x2/0x30
      [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
      [1417839.860590]  filemap_fault+0x6c7/0x990
      [1417839.869252]  ? xas_load+0x8/0x80
      [1417839.876756]  ? xas_find+0x150/0x190
      [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
      [1417839.894652]  __do_fault+0x32/0x110
      [1417839.902540]  __handle_mm_fault+0xacd/0x1000
      [1417839.912156]  handle_mm_fault+0xaa/0x1c0
      [1417839.921004]  __do_page_fault+0x242/0x4b0
      [1417839.930044]  ? page_fault+0x8/0x30
      [1417839.937933]  page_fault+0x1e/0x30
      [1417839.945631] RIP: 0033:0x33c4bae
      [1417839.952927] Code: Bad RIP value.
      [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
      [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
      [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
      [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
      [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
      [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6bf9cd2e
    • Marcos Paulo de Souza's avatar
      btrfs: convert comments to fallthrough annotations · c730ae0c
      Marcos Paulo de Souza authored
      
      
      Convert fall through comments to the pseudo-keyword which is now the
      preferred way.
      
      Signed-off-by: default avatarMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c730ae0c
    • Ronnie Sahlberg's avatar
      cifs: prevent truncation from long to int in wait_for_free_credits · 19e88867
      Ronnie Sahlberg authored
      
      
      The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
      the value.
      
      Reported-by: default avatarMarshall Midden <marshallmidden@gmail.com>
      Signed-off-by: default avatarRonnie Sahlberg <lsahlber@redhat.com>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      19e88867
    • Zhang Xiaoxu's avatar
      cifs: Fix the target file was deleted when rename failed. · 9ffad926
      Zhang Xiaoxu authored
      
      
      When xfstest generic/035, we found the target file was deleted
      if the rename return -EACESS.
      
      In cifs_rename2, we unlink the positive target dentry if rename
      failed with EACESS or EEXIST, even if the target dentry is positived
      before rename. Then the existing file was deleted.
      
      We should just delete the target file which created during the
      rename.
      
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Reviewed-by: default avatarAurelien Aptel <aaptel@suse.com>
      9ffad926
Loading