Skip to content
  1. Mar 27, 2018
  2. Mar 26, 2018
  3. Mar 23, 2018
  4. Mar 20, 2018
  5. Mar 19, 2018
  6. Mar 16, 2018
  7. Mar 15, 2018
    • Eric W. Biederman's avatar
      fs: Teach path_connected to handle nfs filesystems with multiple roots. · 95dd7758
      Eric W. Biederman authored
      
      
      On nfsv2 and nfsv3 the nfs server can export subsets of the same
      filesystem and report the same filesystem identifier, so that the nfs
      client can know they are the same filesystem.  The subsets can be from
      disjoint directory trees.  The nfsv2 and nfsv3 filesystems provides no
      way to find the common root of all directory trees exported form the
      server with the same filesystem identifier.
      
      The practical result is that in struct super s_root for nfs s_root is
      not necessarily the root of the filesystem.  The nfs mount code sets
      s_root to the root of the first subset of the nfs filesystem that the
      kernel mounts.
      
      This effects the dcache invalidation code in generic_shutdown_super
      currently called shrunk_dcache_for_umount and that code for years
      has gone through an additional list of dentries that might be dentry
      trees that need to be freed to accomodate nfs.
      
      When I wrote path_connected I did not realize nfs was so special, and
      it's hueristic for avoiding calling is_subdir can fail.
      
      The practical case where this fails is when there is a move of a
      directory from the subtree exposed by one nfs mount to the subtree
      exposed by another nfs mount.  This move can happen either locally or
      remotely.  With the remote case requiring that the move directory be cached
      before the move and that after the move someone walks the path
      to where the move directory now exists and in so doing causes the
      already cached directory to be moved in the dcache through the magic
      of d_splice_alias.
      
      If someone whose working directory is in the move directory or a
      subdirectory and now starts calling .. from the initial mount of nfs
      (where s_root == mnt_root), then path_connected as a heuristic will
      not bother with the is_subdir check.  As s_root really is not the root
      of the nfs filesystem this heuristic is wrong, and the path may
      actually not be connected and path_connected can fail.
      
      The is_subdir function might be cheap enough that we can call it
      unconditionally.  Verifying that will take some benchmarking and
      the result may not be the same on all kernels this fix needs
      to be backported to.  So I am avoiding that for now.
      
      Filesystems with snapshots such as nilfs and btrfs do something
      similar.  But as the directory tree of the snapshots are disjoint
      from one another and from the main directory tree rename won't move
      things between them and this problem will not occur.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 397d425d ("vfs: Test for and handle paths that are unreachable from their mnt_root")
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      95dd7758
  8. Mar 14, 2018
    • Edmund Nadolski's avatar
      btrfs: add missing initialization in btrfs_check_shared · 18bf591b
      Edmund Nadolski authored
      
      
      This patch addresses an issue that causes fiemap to falsely
      report a shared extent.  The test case is as follows:
      
      xfs_io -f -d -c "pwrite -b 16k 0 64k" -c "fiemap -v" /media/scratch/file5
      sync
      xfs_io  -c "fiemap -v" /media/scratch/file5
      
      which gives the resulting output:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (121.359 MiB/sec and 7766.9903 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128 0x2001
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      This is because btrfs_check_shared calls find_parent_nodes
      repeatedly in a loop, passing a share_check struct to report
      the count of shared extent. But btrfs_check_shared does not
      re-initialize the count value to zero for subsequent calls
      from the loop, resulting in a false share count value. This
      is a regressive behavior from 4.13.
      
      With proper re-initialization the test result is as follows:
      
      wrote 65536/65536 bytes at offset 0
      64 KiB, 4 ops; 0.0000 sec (110.035 MiB/sec and 7042.2535 ops/sec)
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      /media/scratch/file5:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..127]:        24576..24703       128   0x1
      
      which corrects the regression.
      
      Fixes: 3ec4d323 ("btrfs: allow backref search checks for shared extents")
      Signed-off-by: default avatarEdmund Nadolski <enadolski@suse.com>
      [ add text from cover letter to changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      18bf591b
    • Dmitriy Gorokh's avatar
      btrfs: Fix NULL pointer exception in find_bio_stripe · 047fdea6
      Dmitriy Gorokh authored
      
      
      On detaching of a disk which is a part of a RAID6 filesystem, the
      following kernel OOPS may happen:
      
      [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0
      [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
      [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
      [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo
      [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0
      [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo
      [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
      [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs]
      [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0
      [63122.971202] Oops: 0000 [#1] SMP
      [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8
      [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs]
      [63123.007595] task: ffff880036ea4040 task.stack: ffffc90006384000
      [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs]
      [63123.007968] RSP: 0018:ffffc90006387ad8 EFLAGS: 00010287
      [63123.008140] RAX: 0000000000000002 RBX: ffff88004beaa0b8 RCX: ffff8800b2bd5690
      [63123.008359] RDX: 0000000000000000 RSI: ffff88007bb43500 RDI: ffff88004beaa000
      [63123.008621] RBP: ffffc90006387ae8 R08: 0000000099100000 R09: ffff8800b2bd5600
      [63123.008840] R10: 0000000000000004 R11: 0000000000010000 R12: ffff88007bb43500
      [63123.009059] R13: 00000000fffffffb R14: ffff880036fc5180 R15: 0000000000000004
      [63123.009278] FS: 0000000000000000(0000) GS:ffff8800b7000000(0000) knlGS:0000000000000000
      [63123.009564] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [63123.009748] CR2: 0000000000000080 CR3: 00000000b0866000 CR4: 00000000000406f0
      [63123.009969] Call Trace:
      [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs]
      [63123.010251] bio_endio+0xa1/0x120
      [63123.010378] generic_make_request+0x218/0x270
      [63123.010921] submit_bio+0x66/0x130
      [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs]
      [63123.011245] full_stripe_write+0x96/0xc0 [btrfs]
      [63123.011428] raid56_parity_write+0x117/0x170 [btrfs]
      [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs]
      [63123.011759] ? ___cache_free+0x1c5/0x300
      [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs]
      [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs]
      [63123.012257] normal_work_helper+0x19e/0x300 [btrfs]
      [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs]
      [63123.012656] process_one_work+0x14d/0x350
      [63123.012888] worker_thread+0x4d/0x3a0
      [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20
      [63123.013192] kthread+0x109/0x140
      [63123.013315] ? process_scheduled_works+0x40/0x40
      [63123.013472] ? kthread_stop+0x110/0x110
      [63123.013610] ret_from_fork+0x25/0x30
      [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: ffffc90006387ad8
      [63123.014678] CR2: 0000000000000080
      [63123.016590] ---[ end trace a295ea7259c17880 ]—
      
      This is reproducible in a cycle, where a series of writes is followed by
      SCSI device delete command. The test may take up to few minutes.
      
      Fixes: 74d46992 ("block: replace bi_bdev with a gendisk pointer and partitions index")
      [ no signed-off-by provided ]
      Author: Dmitriy Gorokh <Dmitriy.Gorokh@wdc.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      047fdea6
    • Tejun Heo's avatar
      fs/aio: Use RCU accessors for kioctx_table->table[] · d0264c01
      Tejun Heo authored
      
      
      While converting ioctx index from a list to a table, db446a08
      ("aio: convert the ioctx list to table lookup v3") missed tagging
      kioctx_table->table[] as an array of RCU pointers and using the
      appropriate RCU accessors.  This introduces a small window in the
      lookup path where init and access may race.
      
      Mark kioctx_table->table[] with __rcu and use the approriate RCU
      accessors when using the field.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Fixes: db446a08 ("aio: convert the ioctx list to table lookup v3")
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.12+
      d0264c01
    • Tejun Heo's avatar
      fs/aio: Add explicit RCU grace period when freeing kioctx · a6d7cff4
      Tejun Heo authored
      
      
      While fixing refcounting, e34ecee2 ("aio: Fix a trinity splat")
      incorrectly removed explicit RCU grace period before freeing kioctx.
      The intention seems to be depending on the internal RCU grace periods
      of percpu_ref; however, percpu_ref uses a different flavor of RCU,
      sched-RCU.  This can lead to kioctx being freed while RCU read
      protected dereferences are still in progress.
      
      Fix it by updating free_ioctx() to go through call_rcu() explicitly.
      
      v2: Comment added to explain double bouncing.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Fixes: e34ecee2 ("aio: Fix a trinity splat")
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@vger.kernel.org # v3.13+
      a6d7cff4
  9. Mar 08, 2018
  10. Mar 07, 2018
  11. Mar 01, 2018
    • Christoph Hellwig's avatar
      xfs: don't block on the ilock for RWF_NOWAIT · ff3d8b9c
      Christoph Hellwig authored
      
      
      Fix xfs_file_iomap_begin to trylock the ilock if IOMAP_NOWAIT is passed,
      so that we don't block io_submit callers.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      ff3d8b9c
    • Christoph Hellwig's avatar
      xfs: don't start out with the exclusive ilock for direct I/O · af5b5afe
      Christoph Hellwig authored
      
      
      There is no reason to take the ilock exclusively at the start of
      xfs_file_iomap_begin for direct I/O, given that it will be demoted
      just before calling xfs_iomap_write_direct anyway.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      af5b5afe
    • Christoph Hellwig's avatar
      xfs: don't allocate COW blocks for zeroing holes or unwritten extents · 172ed391
      Christoph Hellwig authored
      
      
      The iomap zeroing interface is smart enough to skip zeroing holes or
      unwritten extents.  Don't subvert this logic for reflink files.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      172ed391
    • Chengguang Xu's avatar
      ceph: fix potential memory leak in init_caches() · 1c789249
      Chengguang Xu authored
      
      
      There is lack of cache destroy operation for ceph_file_cachep
      when failing from fscache register.
      
      Signed-off-by: default avatarChengguang Xu <cgxu519@icloud.com>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      1c789249
    • Filipe Manana's avatar
      Btrfs: fix log replay failure after unlink and link combination · 1f250e92
      Filipe Manana authored
      
      
      If we have a file with 2 (or more) hard links in the same directory,
      remove one of the hard links, create a new file (or link an existing file)
      in the same directory with the name of the removed hard link, and then
      finally fsync the new file, we end up with a log that fails to replay,
      causing a mount failure.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/testdir
        $ touch /mnt/testdir/foo
        $ ln /mnt/testdir/foo /mnt/testdir/bar
      
        $ sync
      
        $ unlink /mnt/testdir/bar
        $ touch /mnt/testdir/bar
        $ xfs_io -c "fsync" /mnt/testdir/bar
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        mount: mount(2) failed: /mnt: No such file or directory
      
      When replaying the log, for that example, we also see the following in
      dmesg/syslog:
      
        [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, inode 258 parent 257
        [71813.674204] ------------[ cut here ]------------
        [71813.675694] BTRFS: Transaction aborted (error -2)
        [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 __btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
        [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: G        W        4.15.0-rc9-btrfs-next-56+ #1
        [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
        [71813.679669] RSP: 0018:ffffc90001cef738 EFLAGS: 00010286
        [71813.679669] RAX: 0000000000000025 RBX: ffff880217ce4708 RCX: 0000000000000001
        [71813.679669] RDX: 0000000000000000 RSI: ffffffff81c14bae RDI: 00000000ffffffff
        [71813.679669] RBP: ffffc90001cef7c0 R08: 0000000000000001 R09: 0000000000000001
        [71813.679669] R10: ffffc90001cef5e0 R11: ffffffff8343f007 R12: ffff880217d474c8
        [71813.679669] R13: 00000000fffffffe R14: ffff88021ccf1548 R15: 0000000000000101
        [71813.679669] FS:  00007f7cee84c480(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
        [71813.679669] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [71813.679669] CR2: 00007f7cedc1abf9 CR3: 00000002354b4003 CR4: 00000000001606e0
        [71813.679669] Call Trace:
        [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
        [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
        [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
        [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
        [71813.679669]  ? rcu_read_unlock+0x3a/0x57
        [71813.679669]  ? __lock_is_held+0x39/0x71
        [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
        [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
        [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
        [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
        [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
        [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
        [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
        [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
        [71813.679669]  ? mount_fs+0x64/0x10b
        [71813.679669]  mount_fs+0x64/0x10b
        [71813.679669]  vfs_kern_mount+0x68/0xce
        [71813.679669]  do_mount+0x6e5/0x973
        [71813.679669]  ? memdup_user+0x3e/0x5c
        [71813.679669]  SyS_mount+0x72/0x98
        [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
        [71813.679669] RIP: 0033:0x7f7cedf150ba
        [71813.679669] RSP: 002b:00007ffca71da688 EFLAGS: 00000206
        [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
        [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
        [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: errno=-2 No such entry
        [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 No such entry (Failed to recover log tree)
        [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned -30
        [71814.128078] BTRFS error (device dm-0): open_ctree failed
      
      This happens because the log has inode reference items for both inode 258
      (the first file we created) and inode 259 (the second file created), and
      when processing the reference item for inode 258, we replace the
      corresponding item in the subvolume tree (which has two names, "foo" and
      "bar") witht he one in the log (which only has one name, "foo") without
      removing the corresponding dir index keys from the parent directory.
      Later, when processing the inode reference item for inode 259, which has
      a name of "bar" associated to it, we notice that dir index entries exist
      for that name and for a different inode, so we attempt to unlink that
      name, which fails because the inode reference item for inode 258 no longer
      has the name "bar" associated to it, making a call to btrfs_unlink_inode()
      fail with a -ENOENT error.
      
      Fix this by unlinking all the names in an inode reference item from a
      subvolume tree that are not present in the inode reference item found in
      the log tree, before overwriting it with the item from the log tree.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f250e92
    • Filipe Manana's avatar
      Btrfs: fix log replay failure after linking special file and fsync · 9a6509c4
      Filipe Manana authored
      
      
      If in the same transaction we rename a special file (fifo, character/block
      device or symbolic link), create a hard link for it having its old name
      then sync the log, we will end up with a log that can not be replayed and
      at when attempting to replay it, an EEXIST error is returned and mounting
      the filesystem fails. Example scenario:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/testdir
        $ mkfifo /mnt/testdir/foo
        # Make sure everything done so far is durably persisted.
        $ sync
      
        # Create some unrelated file and fsync it, this is just to create a log
        # tree. The file must be in the same directory as our special file.
        $ touch /mnt/testdir/f1
        $ xfs_io -c "fsync" /mnt/testdir/f1
      
        # Rename our special file and then create a hard link with its old name.
        $ mv /mnt/testdir/foo /mnt/testdir/bar
        $ ln /mnt/testdir/bar /mnt/testdir/foo
      
        # Create some other unrelated file and fsync it, this is just to persist
        # the log tree which was modified by the previous rename and link
        # operations. Alternatively we could have modified file f1 and fsync it.
        $ touch /mnt/f2
        $ xfs_io -c "fsync" /mnt/f2
      
        <power failure>
      
        $ mount /dev/sdc /mnt
        mount: mount /dev/sdc on /mnt failed: File exists
      
      This happens because when both the log tree and the subvolume's tree have
      an entry in the directory "testdir" with the same name, that is, there
      is one key (258 INODE_REF 257) in the subvolume tree and another one in
      the log tree (where 258 is the inode number of our special file and 257
      is the inode for directory "testdir"). Only the data of those two keys
      differs, in the subvolume tree the index field for inode reference has
      a value of 3 while the log tree it has a value of 5. Because the same key
      exists in both trees, but have different index, the log replay fails with
      an -EEXIST error when attempting to replay the inode reference from the
      log tree.
      
      Fix this by setting the last_unlink_trans field of the inode (our special
      file) to the current transaction id when a hard link is created, as this
      forces logging the parent directory inode, solving the conflict at log
      replay time.
      
      A new generic test case for fstests was also submitted.
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a6509c4
    • Filipe Manana's avatar
      Btrfs: send, fix issuing write op when processing hole in no data mode · d4dfc0f4
      Filipe Manana authored
      
      
      When doing an incremental send of a filesystem with the no-holes feature
      enabled, we end up issuing a write operation when using the no data mode
      send flag, instead of issuing an update extent operation. Fix this by
      issuing the update extent operation instead.
      
      Trivial reproducer:
      
        $ mkfs.btrfs -f -O no-holes /dev/sdc
        $ mkfs.btrfs -f /dev/sdd
        $ mount /dev/sdc /mnt/sdc
        $ mount /dev/sdd /mnt/sdd
      
        $ xfs_io -f -c "pwrite -S 0xab 0 32K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap1
      
        $ xfs_io -c "fpunch 8K 8K" /mnt/sdc/foobar
        $ btrfs subvolume snapshot -r /mnt/sdc /mnt/sdc/snap2
      
        $ btrfs send /mnt/sdc/snap1 | btrfs receive /mnt/sdd
        $ btrfs send --no-data -p /mnt/sdc/snap1 /mnt/sdc/snap2 \
             | btrfs receive -vv /mnt/sdd
      
      Before this change the output of the second receive command is:
      
        receiving snapshot snap2 uuid=f6922049-8c22-e544-9ff9-fc6755918447...
        utimes
        write foobar, offset 8192, len 8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=f6922049-8c22-e544-9ff9-...
      
      After this change it is:
      
        receiving snapshot snap2 uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
        utimes
        update_extent foobar: offset=8192, len=8192
        utimes foobar
        BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=564d36a3-ebc8-7343-aec9-bf6fda278e64...
      
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4dfc0f4
    • Anand Jain's avatar
      btrfs: use proper endianness accessors for super_copy · 3c181c12
      Anand Jain authored
      
      
      The fs_info::super_copy is a byte copy of the on-disk structure and all
      members must use the accessor macros/functions to obtain the right
      value.  This was missing in update_super_roots and in sysfs readers.
      
      Moving between opposite endianness hosts will report bogus numbers in
      sysfs, and mount may fail as the root will not be restored correctly. If
      the filesystem is always used on a same endian host, this will not be a
      problem.
      
      Fix this by using the btrfs_set_super...() functions to set
      fs_info::super_copy values, and for the sysfs, use the cached
      fs_info::nodesize/sectorsize values.
      
      CC: stable@vger.kernel.org
      Fixes: df93589a ("btrfs: export more from FS_INFO to sysfs")
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c181c12
    • Hans van Kranenburg's avatar
      btrfs: alloc_chunk: fix DUP stripe size handling · 92e222df
      Hans van Kranenburg authored
      
      
      In case of using DUP, we search for enough unallocated disk space on a
      device to hold two stripes.
      
      The devices_info[ndevs-1].max_avail that holds the amount of unallocated
      space found is directly assigned to stripe_size, while it's actually
      twice the stripe size.
      
      Later on in the code, an unconditional division of stripe_size by
      dev_stripes corrects the value, but in the meantime there's a check to
      see if the stripe_size does not exceed max_chunk_size. Since during this
      check stripe_size is twice the amount as intended, the check will reduce
      the stripe_size to max_chunk_size if the actual correct to be used
      stripe_size is more than half the amount of max_chunk_size.
      
      The unconditional division later tries to correct stripe_size, but will
      actually make sure we can't allocate more than half the max_chunk_size.
      
      Fix this by moving the division by dev_stripes before the max chunk size
      check, so it always contains the right value, instead of putting a duct
      tape division in further on to get it fixed again.
      
      Since in all other cases than DUP, dev_stripes is 1, this change only
      affects DUP.
      
      Other attempts in the past were made to fix this:
      * 37db63a4 "Btrfs: fix max chunk size check in chunk allocator" tried
      to fix the same problem, but still resulted in part of the code acting
      on a wrongly doubled stripe_size value.
      * 86db2578 "Btrfs: fix max chunk size on raid5/6" unintentionally
      broke this fix again.
      
      The real problem was already introduced with the rest of the code in
      73c5de00.
      
      The user visible result however will be that the max chunk size for DUP
      will suddenly double, while it's actually acting according to the limits
      in the code again like it was 5 years ago.
      
      Reported-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Link: https://www.spinics.net/lists/linux-btrfs/msg69752.html
      
      
      Fixes: 73c5de00 ("btrfs: quasi-round-robin for chunk allocation")
      Fixes: 86db2578 ("Btrfs: fix max chunk size on raid5/6")
      Signed-off-by: default avatarHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comment ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92e222df
    • Nikolay Borisov's avatar
      btrfs: Handle btrfs_set_extent_delalloc failure in relocate_file_extent_cluster · 765f3ceb
      Nikolay Borisov authored
      
      
      Essentially duplicate the error handling from the above block which
      handles the !PageUptodate(page) case and additionally clear
      EXTENT_BOUNDARY.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      765f3ceb
    • Nikolay Borisov's avatar
      btrfs: handle failure of add_pending_csums · ac01f26a
      Nikolay Borisov authored
      
      
      add_pending_csums was added as part of the new data=ordered
      implementation in e6dcd2dc ("Btrfs: New data=ordered
      implementation"). Even back then it called the btrfs_csum_file_blocks
      which can fail but it never bothered handling the failure. In ENOMEM
      situation this could lead to the filesystem failing to write the
      checksums for a particular extent and not detect this. On read this
      could lead to the filesystem erroring out due to crc mismatch. Fix it by
      propagating failure from add_pending_csums and handling them.
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac01f26a
    • Jeff Mahoney's avatar
      btrfs: use kvzalloc to allocate btrfs_fs_info · a8fd1f71
      Jeff Mahoney authored
      
      
      The srcu_struct in btrfs_fs_info scales in size with NR_CPUS.  On
      kernels built with NR_CPUS=8192, this can result in kmalloc failures
      that prevent mounting.
      
      There is work in progress to try to resolve this for every user of
      srcu_struct but using kvzalloc will work around the failures until
      that is complete.
      
      As an example with NR_CPUS=512 on x86_64: the overall size of
      subvol_srcu is 3460 bytes, fs_info is 6496.
      
      Signed-off-by: default avatarJeff Mahoney <jeffm@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8fd1f71
  12. Feb 27, 2018
  13. Feb 26, 2018
    • Chengguang Xu's avatar
      xfs: fix potential memory leak in mount option parsing · 5b4c845e
      Chengguang Xu authored
      
      
      When specifying string type mount option (e.g., logdev)
      several times in a mount, current option parsing may
      cause memory leak. Hence, call kfree for previous one
      in this case.
      
      Signed-off-by: default avatarChengguang Xu <cgxu519@icloud.com>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      5b4c845e
    • Jan Kara's avatar
      blockdev: Avoid two active bdev inodes for one device · 560e7cb2
      Jan Kara authored
      
      
      When blkdev_open() races with device removal and creation it can happen
      that unhashed bdev inode gets associated with newly created gendisk
      like:
      
      CPU0					CPU1
      blkdev_open()
        bdev = bd_acquire()
      					del_gendisk()
      					  bdev_unhash_inode(bdev);
      					remove device
      					create new device with the same number
        __blkdev_get()
          disk = get_gendisk()
            - gets reference to gendisk of the new device
      
      Now another blkdev_open() will not find original 'bdev' as it got
      unhashed, create a new one and associate it with the same 'disk' at
      which point problems start as we have two independent page caches for
      one device.
      
      Fix the problem by verifying that the bdev inode didn't get unhashed
      before we acquired gendisk reference. That way we make sure gendisk can
      get associated only with visible bdev inodes.
      
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      560e7cb2
    • Jan Kara's avatar
      genhd: Fix use after free in __blkdev_get() · 89736653
      Jan Kara authored
      
      
      When two blkdev_open() calls race with device removal and recreation,
      __blkdev_get() can use looked up gendisk after it is freed:
      
      CPU0				CPU1			CPU2
      							del_gendisk(disk);
      							  bdev_unhash_inode(inode);
      blkdev_open()			blkdev_open()
        bdev = bd_acquire(inode);
          - creates and returns new inode
      				  bdev = bd_acquire(inode);
      				    - returns the same inode
        __blkdev_get(devt)		  __blkdev_get(devt)
          disk = get_gendisk(devt);
            - got structure of device going away
      							<finish device removal>
      							<new device gets
      							 created under the same
      							 device number>
      				  disk = get_gendisk(devt);
      				    - got new device structure
      				  if (!bdev->bd_openers) {
      				    does the first open
      				  }
          if (!bdev->bd_openers)
            - false
          } else {
            put_disk_and_module(disk)
              - remember this was old device - this was last ref and disk is
                now freed
          }
          disk_unblock_events(disk); -> oops
      
      Fix the problem by making sure we drop reference to disk in
      __blkdev_get() only after we are really done with it.
      
      Reported-by: default avatarHou Tao <houtao1@huawei.com>
      Tested-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89736653
    • Jan Kara's avatar
      genhd: Add helper put_disk_and_module() · 9df6c299
      Jan Kara authored
      
      
      Add a proper counterpart to get_disk_and_module() -
      put_disk_and_module(). Currently it is opencoded in several places.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9df6c299
    • Jan Kara's avatar
      direct-io: Fix sleep in atomic due to sync AIO · d9c10e5b
      Jan Kara authored
      
      
      Commit e864f395 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
      way for direct IO to become synchronous and thus trigger fsync from the
      IO completion handler. Then commit 9830f4be "fs: Use RWF_* flags for
      AIO operations" allowed these flags to be set for AIO as well. However
      that commit forgot to update the condition checking whether the IO
      completion handling should be defered to a workqueue and thus AIO DIO
      with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
      sleep in atomic.
      
      Fix the problem by checking directly iocb flags (the same way as it is
      done in dio_complete()) instead of checking all conditions that could
      lead to IO being synchronous.
      
      CC: Christoph Hellwig <hch@lst.de>
      CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
      CC: stable@vger.kernel.org
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Fixes: 9830f4be
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d9c10e5b
    • Vivek Goyal's avatar
      ovl: redirect_dir=nofollow should not follow redirect for opaque lower · d1fe96c0
      Vivek Goyal authored
      
      
      redirect_dir=nofollow should not follow a redirect. But in a specific
      configuration it can still follow it.  For example try this.
      
      $ mkdir -p lower0 lower1/foo upper work merged
      $ touch lower1/foo/lower-file.txt
      $ setfattr -n "trusted.overlay.opaque" -v "y" lower1/foo
      $ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=on none merged
      $ cd merged
      $ mv foo foo-renamed
      $ umount merged
      
      # mount again. This time with redirect_dir=nofollow
      $ mount -t overlay -o lowerdir=lower1:lower0,workdir=work,upperdir=upper,redirect_dir=nofollow none merged
      $ ls merged/foo-renamed/
      # This lists lower-file.txt, while it should not have.
      
      Basically, we are doing redirect check after we check for d.stop. And
      if this is not last lower, and we find an opaque lower, d.stop will be
      set.
      
      ovl_lookup_single()
              if (!d->last && ovl_is_opaquedir(this)) {
                      d->stop = d->opaque = true;
                      goto out;
              }
      
      To fix this, first check redirect is allowed. And after that check if
      d.stop has been set or not.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Fixes: 438c84c2 ("ovl: don't follow redirects if redirect_dir=off")
      Cc: <stable@vger.kernel.org> #v4.15
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      d1fe96c0
    • Chengguang Xu's avatar
      ceph: fix dentry leak when failing to init debugfs · 18106734
      Chengguang Xu authored
      
      
      When failing from ceph_fs_debugfs_init() in ceph_real_mount(),
      there is lack of dput of root_dentry and it causes slab errors,
      so change the calling order of ceph_fs_debugfs_init() and
      open_root_dentry() and do some cleanups to avoid this issue.
      
      Signed-off-by: default avatarChengguang Xu <cgxu519@icloud.com>
      Reviewed-by: default avatar"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      18106734
    • Chengguang Xu's avatar
      libceph, ceph: avoid memory leak when specifying same option several times · 937441f3
      Chengguang Xu authored
      
      
      When parsing string option, in order to avoid memory leak we need to
      carefully free it first in case of specifying same option several times.
      
      Signed-off-by: default avatarChengguang Xu <cgxu519@icloud.com>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      937441f3
Loading