Skip to content
  1. Feb 09, 2019
    • Coly Li's avatar
      bcache: export backing_dev_name via sysfs · 926d1946
      Coly Li authored
      
      
      This patch export dc->backing_dev_name to sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
      tools may know the backing device name of this bcache device.
      
      Of cause it can be done by parsing sysfs links, but this method can be
      much simpler to find the link between bcache device and backing device.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      926d1946
    • Coly Li's avatar
      bcache: not use hard coded memset size in bch_cache_accounting_clear() · 83ff9318
      Coly Li authored
      
      
      In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
      used in memset(). It is because in struct cache_stats, there are 7
      atomic_t type members. This is not good when new members added into
      struct stats, the hard coded number will only clear part of memory.
      
      This patch replaces 'sizeof(unsigned long) * 7' by more generic
      'sizeof(struct cache_stats))', to avoid potential error if new
      member added into struct cache_stats.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      83ff9318
    • Daniel Axtens's avatar
      bcache: never writeback a discard operation · 9951379b
      Daniel Axtens authored
      Some users see panics like the following when performing fstrim on a
      bcached volume:
      
      [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  530.183928] #PF error: [normal kernel read fault]
      [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
      [  530.750887] Oops: 0000 [#1] SMP PTI
      [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
      [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
      [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
      [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
      8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
      [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
      [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
      [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
      [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
      [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
      [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
      [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
      [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
      [  535.851759] Call Trace:
      [  535.970308]  ? mempool_alloc_slab+0x15/0x20
      [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
      [  536.403399]  blk_mq_make_request+0x97/0x4f0
      [  536.607036]  generic_make_request+0x1e2/0x410
      [  536.819164]  submit_bio+0x73/0x150
      [  536.980168]  ? submit_bio+0x73/0x150
      [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
      [  537.391595]  ? _cond_resched+0x1a/0x50
      [  537.573774]  submit_bio_wait+0x59/0x90
      [  537.756105]  blkdev_issue_discard+0x80/0xd0
      [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
      [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
      [  538.324087]  ext4_ioctl+0xea4/0x1530
      [  538.497712]  ? _copy_to_user+0x2a/0x40
      [  538.679632]  do_vfs_ioctl+0xa6/0x600
      [  538.853127]  ? __do_sys_newfstat+0x44/0x70
      [  539.051951]  ksys_ioctl+0x6d/0x80
      [  539.212785]  __x64_sys_ioctl+0x1a/0x20
      [  539.394918]  do_syscall_64+0x5a/0x110
      [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      We have observed it where both:
      1) LVM/devmapper is involved (bcache backing device is LVM volume) and
      2) writeback cache is involved (bcache cache_mode is writeback)
      
      On one machine, we can reliably reproduce it with:
      
       # echo writeback > /sys/block/bcache0/bcache/cache_mode
         (not sure whether above line is required)
       # mount /dev/bcache0 /test
       # for i in {0..10}; do
      	file="$(mktemp /test/zero.XXX)"
      	dd if=/dev/zero of="$file" bs=1M count=256
      	sync
      	rm $file
          done
        # fstrim -v /test
      
      Observing this with tracepoints on, we see the following writes:
      
      fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
      <crash>
      
      Note the final one has different hit/bypass flags.
      
      This is because in should_writeback(), we were hitting a case where
      the partial stripe condition was returning true and so
      should_writeback() was returning true early.
      
      If that hadn't been the case, it would have hit the would_skip test, and
      as would_skip == s->iop.bypass == true, should_writeback() would have
      returned false.
      
      Looking at the git history from 'commit 72c27061 ("bcache: Write out
      full stripes")', it looks like the idea was to optimise for raid5/6:
      
             * If a stripe is already dirty, force writes to that stripe to
      	 writeback mode - to help build up full stripes of dirty data
      
      To fix this issue, make sure that should_writeback() on a discard op
      never returns true.
      
      More details of debugging:
      https://www.spinics.net/lists/linux-bcache/msg06996.html
      
      Previous reports:
       - https://bugzilla.kernel.org/show_bug.cgi?id=201051
       - https://bugzilla.kernel.org/show_bug.cgi?id=196103
       - https://www.spinics.net/lists/linux-bcache/msg06885.html
      
      
      
      (Coly Li: minor modification to follow maximum 75 chars per line rule)
      
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 72c27061 ("bcache: Write out full stripes")
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9951379b
  2. Feb 04, 2019
  3. Jan 22, 2019
  4. Jan 21, 2019
    • Mike Snitzer's avatar
      dm: fix redundant IO accounting for bios that need splitting · a1e1cb72
      Mike Snitzer authored
      
      
      The risk of redundant IO accounting was not taken into consideration
      when commit 18a25da8 ("dm: ensure bio submission follows a
      depth-first tree walk") introduced IO splitting in terms of recursion
      via generic_make_request().
      
      Fix this by subtracting the split bio's payload from the IO stats that
      were already accounted for by start_io_acct() upon dm_make_request()
      entry.  This repeat oscillation of the IO accounting, up then down,
      isn't ideal but refactoring DM core's IO splitting to pre-split bios
      _before_ they are accounted turned out to be an excessive amount of
      change that will need a full development cycle to refine and verify.
      
      Before this fix:
      
        /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
        bios are split on 32k boundaries.
      
        # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
          	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers
      
        with debugging added:
        [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
        [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
        [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
        ...
      
        16M written yet 136M (278528 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        278528
      
      After this fix:
      
        16M written and 16M (32768 * 512b) accounted:
        # cat /sys/block/dm-2/stat | awk '{ print $7 }'
        32768
      
      Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
      Cc: stable@vger.kernel.org # 4.16+
      Reported-by: default avatarBryan Gurney <bgurney@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      a1e1cb72
    • Mike Snitzer's avatar
      dm: fix clone_bio() to trigger blk_recount_segments() · 57c36519
      Mike Snitzer authored
      
      
      DM's clone_bio() now benefits from using bio_trim() by fixing the fact
      that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
      which triggers blk_recount_segments() via bio_phys_segments().
      
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      57c36519
  5. Jan 15, 2019
    • Joe Thornber's avatar
      dm thin: fix passdown_double_checking_shared_status() · d445bd9c
      Joe Thornber authored
      
      
      Commit 00a0ea33 ("dm thin: do not queue freed thin mapping for next
      stage processing") changed process_prepared_discard_passdown_pt1() to
      increment all the blocks being discarded until after the passdown had
      completed to avoid them being prematurely reused.
      
      IO issued to a thin device that breaks sharing with a snapshot, followed
      by a discard issued to snapshot(s) that previously shared the block(s),
      results in passdown_double_checking_shared_status() being called to
      iterate through the blocks double checking their reference count is zero
      and issuing the passdown if so.  So a side effect of commit 00a0ea33
      is passdown_double_checking_shared_status() was broken.
      
      Fix this by checking if the block reference count is greater than 1.
      Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().
      
      Fixes: 00a0ea33 ("dm thin: do not queue freed thin mapping for next stage processing")
      Cc: stable@vger.kernel.org # 4.9+
      Reported-by: default avatar <ryan.p.norwood@gmail.com>
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d445bd9c
  6. Jan 14, 2019
  7. Jan 10, 2019
    • Milan Broz's avatar
      dm crypt: fix parsing of extended IV arguments · 1856b9f7
      Milan Broz authored
      
      
      The dm-crypt cipher specification in a mapping table is defined as:
        cipher[:keycount]-chainmode-ivmode[:ivopts]
      or (new crypt API format):
        capi:cipher_api_spec-ivmode[:ivopts]
      
      For ESSIV, the parameter includes hash specification, for example:
      aes-cbc-essiv:sha256
      
      The implementation expected that additional IV option to never include
      another dash '-' character.
      
      But, with SHA3, there are names like sha3-256; so the mapping table
      parser fails:
      
      dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
        or (new crypt API format)
      dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
      
        device-mapper: crypt: Ignoring unexpected additional cipher options
        device-mapper: table: 253:0: crypt: Error creating IV
        device-mapper: ioctl: error adding target to table
      
      Fix the dm-crypt constructor to ignore additional dash in IV options and
      also remove a bogus warning (that is ignored anyway).
      
      Cc: stable@vger.kernel.org # 4.12+
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      1856b9f7
  8. Dec 28, 2018
  9. Dec 20, 2018
    • Guoqing Jiang's avatar
      md: fix raid10 hang issue caused by barrier · e820d55c
      Guoqing Jiang authored
      
      
      When both regular IO and resync IO happen at the same time,
      and if we also need to split regular. Then we can see tasks
      hang due to barrier.
      
      1. resync thread
      [ 1463.757205] INFO: task md1_resync:5215 blocked for more than 480 seconds.
      [ 1463.757207]       Not tainted 4.19.5-1-default #1
      [ 1463.757209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.757212] md1_resync      D    0  5215      2 0x80000000
      [ 1463.757216] Call Trace:
      [ 1463.757223]  ? __schedule+0x29a/0x880
      [ 1463.757231]  ? raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757236]  schedule+0x78/0x110
      [ 1463.757243]  raise_barrier+0x8d/0x140 [raid10]
      [ 1463.757248]  ? wait_woken+0x80/0x80
      [ 1463.757257]  raid10_sync_request+0x1f6/0x1e30 [raid10]
      [ 1463.757265]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.757284]  ? is_mddev_idle+0x125/0x137 [md_mod]
      [ 1463.757302]  md_do_sync.cold.78+0x404/0x969 [md_mod]
      [ 1463.757311]  ? wait_woken+0x80/0x80
      [ 1463.757336]  ? md_rdev_init+0xb0/0xb0 [md_mod]
      [ 1463.757351]  md_thread+0xe9/0x140 [md_mod]
      [ 1463.757358]  ? _raw_spin_unlock_irqrestore+0x2e/0x60
      [ 1463.757364]  ? __kthread_parkme+0x4c/0x70
      [ 1463.757369]  kthread+0x112/0x130
      [ 1463.757374]  ? kthread_create_worker_on_cpu+0x40/0x40
      [ 1463.757380]  ret_from_fork+0x3a/0x50
      
      2. regular IO
      [ 1463.760679] INFO: task kworker/0:8:5367 blocked for more than 480 seconds.
      [ 1463.760683]       Not tainted 4.19.5-1-default #1
      [ 1463.760684] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 1463.760687] kworker/0:8     D    0  5367      2 0x80000000
      [ 1463.760718] Workqueue: md submit_flushes [md_mod]
      [ 1463.760721] Call Trace:
      [ 1463.760731]  ? __schedule+0x29a/0x880
      [ 1463.760741]  ? wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760746]  schedule+0x78/0x110
      [ 1463.760753]  wait_barrier+0xdd/0x170 [raid10]
      [ 1463.760761]  ? wait_woken+0x80/0x80
      [ 1463.760768]  raid10_write_request+0xf2/0x900 [raid10]
      [ 1463.760774]  ? wait_woken+0x80/0x80
      [ 1463.760778]  ? mempool_alloc+0x55/0x160
      [ 1463.760795]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760801]  ? try_to_wake_up+0x44/0x470
      [ 1463.760810]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760816]  ? wait_woken+0x80/0x80
      [ 1463.760831]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760851]  md_make_request+0x78/0x190 [md_mod]
      [ 1463.760860]  generic_make_request+0x1c6/0x470
      [ 1463.760870]  raid10_write_request+0x77a/0x900 [raid10]
      [ 1463.760875]  ? wait_woken+0x80/0x80
      [ 1463.760879]  ? mempool_alloc+0x55/0x160
      [ 1463.760895]  ? md_write_start+0xa9/0x270 [md_mod]
      [ 1463.760904]  raid10_make_request+0xc1/0x120 [raid10]
      [ 1463.760910]  ? wait_woken+0x80/0x80
      [ 1463.760926]  md_handle_request+0x121/0x190 [md_mod]
      [ 1463.760931]  ? _raw_spin_unlock_irq+0x22/0x40
      [ 1463.760936]  ? finish_task_switch+0x74/0x260
      [ 1463.760954]  submit_flushes+0x21/0x40 [md_mod]
      
      So resync io is waiting for regular write io to complete to
      decrease nr_pending (conf->barrier++ is called before waiting).
      The regular write io splits another bio after call wait_barrier
      which call nr_pending++, then the splitted bio would continue
      with raid10_write_request -> wait_barrier, so the splitted bio
      has to wait for barrier to be zero, then deadlock happens as
      follows.
      
      	resync io		regular io
      
      	raise_barrier
      				wait_barrier
      				generic_make_request
      				wait_barrier
      
      To resolve the issue, we need to call allow_barrier to decrease
      nr_pending before generic_make_request since regular IO is not
      issued to underlying devices, and wait_barrier is called again
      to ensure no internal IO happening.
      
      Fixes: fc9977dd ("md/raid10: simplify the splitting of requests.")
      Reported-and-tested-by: default avatarSiniša Bandin <sinisa@4net.rs>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      e820d55c
    • Guoqing Jiang's avatar
      raid10: refactor common wait code from regular read/write request · caea3c47
      Guoqing Jiang authored
      
      
      Both raid10_read_request and raid10_write_request share
      the same code at the beginning of them, so introduce
      regular_request_wait to clean up code, and call it in
      both request functions.
      
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      caea3c47
    • Chengguang Xu's avatar
      md: remvoe redundant condition check · 37b22c28
      Chengguang Xu authored
      
      
      mempool_destroy() can handle NULL pointer correctly,
      so there is no need to check NULL pointer before calling
      mempool_destroy().
      
      Signed-off-by: default avatarChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      37b22c28
    • Yue Haibing's avatar
      md: remove set but not used variable 'bi_rdev' · f91389c8
      Yue Haibing authored
      
      
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/md/md.c: In function 'md_integrity_add_rdev':
      drivers/md/md.c:2149:24: warning:
       variable 'bi_rdev' set but not used [-Wunused-but-set-variable]
      
      It not used any more after commit
        1501efad ("md/raid: only permit hot-add of compatible integrity profiles")
      
      Signed-off-by: default avatarYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      f91389c8
  10. Dec 19, 2018
    • Jens Axboe's avatar
      dm: don't reuse bio for flushes · dbe3ece1
      Jens Axboe authored
      
      
      DM currently has a statically allocated bio that it uses to issue empty
      flushes. It doesn't submit this bio, it just uses it for maintaining
      state while setting up clones. Multiple users can access this bio at the
      same time. This wasn't previously an issue, even if it was a bit iffy,
      but with the blkg associations it can become one.
      
      We setup the blkg association, then clone bio's and submit, then remove
      the blkg assocation again. But since we can have multiple tasks doing
      this at the same time, against multiple blkg's, then we can either lose
      references to a blkg, or put it twice. The latter causes complaints on
      the percpu ref being <= 0 when released, and can cause use-after-free as
      well. Ming reports that xfstest generic/475 triggers this:
      
      ------------[ cut here ]------------
      percpu ref (blkg_release) <= 0 (0) after switching to atomic
      WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0
      
      Switch to just using an on-stack bio for this, and get rid of the
      embedded bio.
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Reported-by: default avatarMing Lei <ming.lei@redhat.com>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbe3ece1
  11. Dec 18, 2018
    • Jaegeuk Kim's avatar
      dm: do not allow readahead to limit IO size · c6d6e9b0
      Jaegeuk Kim authored
      
      
      Update DM to set the bdi's io_pages.  This fixes reads to be capped at
      the device's max request size (even if user's read IO exceeds the
      established readahead setting).
      
      Fixes: 9491ae4a ("mm: don't cap request size based on read-ahead setting")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c6d6e9b0
    • Heinz Mauelshagen's avatar
      dm raid: fix false -EBUSY when handling check/repair message · 74694bcb
      Heinz Mauelshagen authored
      
      
      Sending a check/repair message infrequently leads to -EBUSY instead of
      properly identifying an active resync.  This occurs because
      raid_message() is testing recovery bits in a racy way.
      
      Fix by calling decipher_sync_action() from raid_message() to properly
      identify the idle state of the RAID device.
      
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      74694bcb
    • Mike Snitzer's avatar
      dm rq: cleanup leftover code from recently removed q->mq_ops branching · 34743bfd
      Mike Snitzer authored
      
      
      When commit 6a23e05c ("dm: remove legacy request-based IO path")
      removed some q->mq_ops branching from map_request() it left in place a
      goto that was only needed if that branching (and conditional 'r'
      assignment) existed.  Now that the branching is gone map_request()'s
      goto can be removed too.
      
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      34743bfd
    • Eric Biggers's avatar
      dm verity: log the hash algorithm implementation · bbf6a566
      Eric Biggers authored
      
      
      Log the hash algorithm's driver name when a dm-verity target is created.
      This will help people determine whether the expected implementation is
      being used.  It can make an enormous difference; e.g., SHA-256 on ARM
      can be 8x faster with the crypto extensions than without.  It can also
      be useful to know if an implementation using an external crypto
      accelerator is being used instead of a software implementation.
      
      Example message:
      
      [   35.281945] device-mapper: verity: sha256 using implementation "sha256-ce"
      
      We've already found the similar message in fs/crypto/keyinfo.c to be
      very useful.
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      bbf6a566
    • Eric Biggers's avatar
      dm crypt: log the encryption algorithm implementation · af331eba
      Eric Biggers authored
      
      
      Log the encryption algorithm's driver name when a dm-crypt target is
      created.  This will help people determine whether the expected
      implementation is being used.  In some cases we've seen people do
      benchmarks and reject using encryption for performance reasons, when in
      fact they used a much slower implementation than was possible on the
      hardware.  It can make an enormous difference; e.g., AES-XTS on ARM can
      be over 10x faster with the crypto extensions than without.  It can also
      be useful to know if an implementation using an external crypto
      accelerator is being used instead of a software implementation.
      
      Example message:
      
      [   29.307629] device-mapper: crypt: xts(aes) using implementation "xts-aes-ce"
      
      We've already found the similar message in fs/crypto/keyinfo.c to be
      very useful.
      
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      af331eba
    • Colin Ian King's avatar
      dm integrity: fix spelling mistake in workqueue name · e8c2566f
      Colin Ian King authored
      
      
      Rename the workqueue from dm-intergrity-recalc to dm-integrity-recalc.
      
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e8c2566f
    • Sweet Tea's avatar
      dm flakey: Properly corrupt multi-page bios. · a00f5276
      Sweet Tea authored
      
      
      The flakey target is documented to be able to corrupt the Nth byte in
      a bio, but does not corrupt byte indices after the first biovec in the
      bio. Change the corrupting function to actually corrupt the Nth byte
      no matter in which biovec that index falls.
      
      A test device generating two-page bios, atop a flakey device configured
      to corrupt a byte index on the second page, verified both the failure
      to corrupt before this patch and the expected corruption after this
      change.
      
      Signed-off-by: default avatarJohn Dorminy <jdorminy@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      a00f5276
    • Milan Broz's avatar
      dm: Check for device sector overflow if CONFIG_LBDAF is not set · ef87bfc2
      Milan Broz authored
      
      
      Reference to a device in device-mapper table contains offset in sectors.
      
      If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
      several device-mapper targets can overflow this offset and validity
      check is then performed on a wrong offset and a wrong table is activated.
      
      See for example (on 32bit without CONFIG_LBDAF) this overflow:
      
        # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
        # dmsetup table test
        0 2048 linear 8:96 1
      
      This patch adds explicit check for overflow if the offset is sector_t type.
      
      Signed-off-by: default avatarMilan Broz <gmazyland@gmail.com>
      Reviewed-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ef87bfc2
    • AliOS system security's avatar
      dm crypt: use u64 instead of sector_t to store iv_offset · 8d683dcd
      AliOS system security authored
      
      
      The iv_offset in the mapping table of crypt target is a 64bit number when
      IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
      iv_offset of struct crypt_config, cc_sector of struct convert_context and
      iv_sector of struct dm_crypt_request. These structures members are defined
      as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
      kernel. In this situation sector_t is not big enough to store the 64bit
      iv_offset.
      
      Here is a reproducer.
      Prepare test image and device (loop is automatically allocated by cryptsetup):
      
        # dd if=/dev/zero of=tst.img bs=1M count=1
        # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
        --skip 500000000000000000 tst.img test
      
      On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
      and device checksum is wrong:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0
      
        # sha256sum /dev/mapper/test
        533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test
      
      On 64bit system (and on 32bit system with the patch), table and checksum is now correct:
      
        # dmsetup table test --showkeys
        0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0
      
        # sha256sum /dev/mapper/test
        5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test
      
      Signed-off-by: default avatarAliOS system security <alios_sys_security@linux.alibaba.com>
      Tested-and-Reviewed-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      8d683dcd
    • Nikos Tsironis's avatar
      dm kcopyd: Fix bug causing workqueue stalls · d7e6b8df
      Nikos Tsironis authored
      When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
      submitting copy jobs with a source size of 0, the jobs are pushed
      directly to the complete_jobs list, which could be under processing by
      the kcopyd thread. As a result, the kcopyd thread can continue running
      completed jobs indefinitely, without releasing the CPU, as long as
      someone keeps submitting new completed jobs through the aforementioned
      paths. Processing of work items, queued for execution on the same CPU as
      the currently running kcopyd thread, is thus stalled for excessive
      amounts of time, hurting performance.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n parallel_io_to_many_snaps_N
      
      , with 8 active snapshots, we get, in dmesg, messages like the
      following:
      
      [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
      [68899.949282] Showing busy workqueues and worker pools:
      [68899.949288] workqueue events: flags=0x0
      [68899.949295]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949306]     pending: vmstat_shepherd, cache_reap
      [68899.949331] workqueue mm_percpu_wq: flags=0x8
      [68899.949337]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949345]     pending: vmstat_update
      [68899.949387] workqueue dm_bufio_cache: flags=0x8
      [68899.949392]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
      [68899.949400]     pending: work_fn [dm_bufio]
      [68899.949423] workqueue kcopyd: flags=0x8
      [68899.949429]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949437]     pending: do_work [dm_mod]
      [68899.949452] workqueue kcopyd: flags=0x8
      [68899.949458]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
      [68899.949466]     in-flight: 13:do_work [dm_mod]
      [68899.949474]     pending: do_work [dm_mod]
      [68899.949487] workqueue kcopyd: flags=0x8
      [68899.949493]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949501]     pending: do_work [dm_mod]
      [68899.949515] workqueue kcopyd: flags=0x8
      [68899.949521]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949529]     pending: do_work [dm_mod]
      [68899.949541] workqueue kcopyd: flags=0x8
      [68899.949547]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
      [68899.949555]     pending: do_work [dm_mod]
      [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084
      
      Fix this by splitting the complete_jobs list into two parts: A user
      facing part, named callback_jobs, and one used internally by kcopyd,
      retaining the name complete_jobs. dm_kcopyd_do_callback() and
      dispatch_job() now push their jobs to the callback_jobs list, which is
      spliced to the complete_jobs list once, every time the kcopyd thread
      wakes up. This prevents kcopyd from hogging the CPU indefinitely and
      causing workqueue stalls.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * The maximum writing time among all targets is reduced from 09m37.10s
          to 06m04.85s and the total run time of the test is reduced from
          10m43.591s to 7m19.199s
      
      [1] https://github.com/jthornber/device-mapper-test-suite
      
      
      
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d7e6b8df
    • Nikos Tsironis's avatar
      dm snapshot: Fix excessive memory usage and workqueue stalls · 721b1d98
      Nikos Tsironis authored
      kcopyd has no upper limit to the number of jobs one can allocate and
      issue. Under certain workloads this can lead to excessive memory usage
      and workqueue stalls. For example, when creating multiple dm-snapshot
      targets with a 4K chunk size and then writing to the origin through the
      page cache. Syncing the page cache causes a large number of BIOs to be
      issued to the dm-snapshot origin target, which itself issues an even
      larger (because of the BIO splitting taking place) number of kcopyd
      jobs.
      
      Running the following test, from the device mapper test suite [1],
      
        dmtest run --suite snapshot -n many_snapshots_of_same_volume_N
      
      , with 8 active snapshots, results in the kcopyd job slab cache growing
      to 10G. Depending on the available system RAM this can lead to the OOM
      killer killing user processes:
      
      [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
                    nodemask=(null), order=1, oom_score_adj=0
      [463.492894] kthreadd cpuset=/ mems_allowed=0
      [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
      [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      [463.492952] Call Trace:
      [463.492964]  dump_stack+0x7d/0xbb
      [463.492973]  dump_header+0x6b/0x2fc
      [463.492987]  ? lockdep_hardirqs_on+0xee/0x190
      [463.493012]  oom_kill_process+0x302/0x370
      [463.493021]  out_of_memory+0x113/0x560
      [463.493030]  __alloc_pages_slowpath+0xf40/0x1020
      [463.493055]  __alloc_pages_nodemask+0x348/0x3c0
      [463.493067]  cache_grow_begin+0x81/0x8b0
      [463.493072]  ? cache_grow_begin+0x874/0x8b0
      [463.493078]  fallback_alloc+0x1e4/0x280
      [463.493092]  kmem_cache_alloc_node+0xd6/0x370
      [463.493098]  ? copy_process.part.31+0x1c5/0x20d0
      [463.493105]  copy_process.part.31+0x1c5/0x20d0
      [463.493115]  ? __lock_acquire+0x3cc/0x1550
      [463.493121]  ? __switch_to_asm+0x34/0x70
      [463.493129]  ? kthread_create_worker_on_cpu+0x70/0x70
      [463.493135]  ? finish_task_switch+0x90/0x280
      [463.493165]  _do_fork+0xe0/0x6d0
      [463.493191]  ? kthreadd+0x19f/0x220
      [463.493233]  kernel_thread+0x25/0x30
      [463.493235]  kthreadd+0x1bf/0x220
      [463.493242]  ? kthread_create_on_cpu+0x90/0x90
      [463.493248]  ret_from_fork+0x3a/0x50
      [463.493279] Mem-Info:
      [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
      [463.493285]  active_file:80216 inactive_file:80107 isolated_file:435
      [463.493285]  unevictable:0 dirty:51266 writeback:109372 unstable:0
      [463.493285]  slab_reclaimable:31191 slab_unreclaimable:3483521
      [463.493285]  mapped:526 shmem:4903 pagetables:1759 bounce:0
      [463.493285]  free:33623 free_pcp:2392 free_cma:0
      ...
      [463.493489] Unreclaimable slab info:
      [463.493513] Name                      Used          Total
      [463.493522] bio-6                   1028KB       1028KB
      [463.493525] bio-5                   1028KB       1028KB
      [463.493528] dm_snap_pending_exception     236783KB     243789KB
      [463.493531] dm_exception              41KB         42KB
      [463.493534] bio-4                   1216KB       1216KB
      [463.493537] bio-3                 439396KB     439396KB
      [463.493539] kcopyd_job           6973427KB    6973427KB
      ...
      [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
      [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
      [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      Moreover, issuing a large number of kcopyd jobs results in kcopyd
      hogging the CPU, while processing them. As a result, processing of work
      items, queued for execution on the same CPU as the currently running
      kcopyd thread, is stalled for long periods of time, hurting performance.
      Running the aforementioned test we get, in dmesg, messages like the
      following:
      
      [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
      [67501.195586] Showing busy workqueues and worker pools:
      [67501.195591] workqueue events: flags=0x0
      [67501.195597]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195611]     pending: cache_reap
      [67501.195641] workqueue mm_percpu_wq: flags=0x8
      [67501.195645]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195656]     pending: vmstat_update
      [67501.195682] workqueue kblockd: flags=0x18
      [67501.195687]   pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
      [67501.195698]     pending: blk_timeout_work
      [67501.195753] workqueue kcopyd: flags=0x8
      [67501.195757]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195768]     pending: do_work [dm_mod]
      [67501.195802] workqueue kcopyd: flags=0x8
      [67501.195806]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195817]     pending: do_work [dm_mod]
      [67501.195834] workqueue kcopyd: flags=0x8
      [67501.195838]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195848]     pending: do_work [dm_mod]
      [67501.195881] workqueue kcopyd: flags=0x8
      [67501.195885]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
      [67501.195896]     pending: do_work [dm_mod]
      [67501.195920] workqueue kcopyd: flags=0x8
      [67501.195924]   pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
      [67501.195935]     in-flight: 67:do_work [dm_mod]
      [67501.195945]     pending: do_work [dm_mod]
      [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765
      
      The root cause for these issues is the way dm-snapshot uses kcopyd. In
      particular, the lack of an explicit or implicit limit to the maximum
      number of in-flight COW jobs. The merging path is not affected because
      it implicitly limits the in-flight kcopyd jobs to one.
      
      Fix these issues by using a semaphore to limit the maximum number of
      in-flight kcopyd jobs. We grab the semaphore before allocating a new
      kcopyd job in start_copy() and start_full_bio() and release it after the
      job finishes in copy_callback().
      
      The initial semaphore value is configurable through a module parameter,
      to allow fine tuning the maximum number of in-flight COW jobs. Setting
      this parameter to zero initializes the semaphore to INT_MAX.
      
      A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
      value was decided experimentally as a trade-off between memory
      consumption, stalling the kernel's workqueues and maintaining a high
      enough throughput.
      
      Re-running the aforementioned test:
      
        * Workqueue stalls are eliminated
        * kcopyd's job slab cache uses a maximum of 130MB
        * The time taken by the test to write to the snapshot-origin target is
          reduced from 05m20.48s to 03m26.38s
      
      [1] https://github.com/jthornber/device-mapper-test-suite
      
      
      
      Signed-off-by: default avatarNikos Tsironis <ntsironis@arrikto.com>
      Signed-off-by: default avatarIlias Tsitsimpis <iliastsi@arrikto.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      721b1d98
    • Shenghui Wang's avatar
      dm bufio: update comment in dm-bufio.c · ef992373
      Shenghui Wang authored
      
      
      * Hashtable has been replaced by rbtree to manage buffers.
        Update the comment.
      * Fix typo in the comment for dm_bufio_issue_flush
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ef992373
    • Shenghui Wang's avatar
      dm writecache: fix typo in error msg for creating writecache_flush_thread · e8ea141a
      Shenghui Wang authored
      
      
      The error msg should be "flush thread" instead of "endio thread"
      for writecache_flush_thread.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e8ea141a
    • Mike Snitzer's avatar
      dm: remove indirect calls from __send_changing_extent_only() · 53b47168
      Mike Snitzer authored
      
      
      No need to be so fancy.
      
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      53b47168
    • wuzhouhui's avatar
      dm mpath: only flush workqueue when needed · 935fcc56
      wuzhouhui authored
      
      
      The workqueues are shared by many multipath devices, only flush whole
      workqueue when necessary.  Otherwise, we just flush works as needed.
      
      Signed-off-by: default avatarwuzhouhui <wuzhouhui14@mails.ucas.ac.cn>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      935fcc56
    • Mike Snitzer's avatar
      2adc5c55
    • Mikulas Patocka's avatar
      dm: avoid indirect call in __dm_make_request · 24113d48
      Mikulas Patocka authored
      
      
      Indirect calls are inefficient because of retpolines that are used for
      spectre workaround. This patch replaces an indirect call with a condition
      (that can be predicted by the branch predictor).
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      24113d48
    • Jens Axboe's avatar
      blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() · 3c94d83c
      Jens Axboe authored
      
      
      There's a single user of this function, dm, and dm just wants
      to check if IO is inflight, not that it's just allocated.
      
      This fixes a hang with srp/002 in blktests with dm, where it tries
      to suspend but waits for inflight IO to finish first. As it checks
      for just allocated requests, this fails.
      
      Tested-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3c94d83c
  12. Dec 13, 2018
    • Guoju Fang's avatar
      bcache: print number of keys in trace_bcache_journal_write · e78bd0d2
      Guoju Fang authored
      
      
      Sometimes flush journal may be very frequent, so it's useful to dump
      number of keys every time write journal.
      
      Signed-off-by: default avatarGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e78bd0d2
    • Coly Li's avatar
      bcache: set writeback_percent in a flexible range · cc38ca7e
      Coly Li authored
      
      
      Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
      dynamic cutoff writeback values, writeback_percent is limited to [0,
      CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
      up to 40.
      
      Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
      the range of writeback_percent can be a more flexible range as [0,
      bch_cutoff_writeback]. The flexibility is, it can be expended to a
      larger or smaller range than [0, 40], depends on how value
      bch_cutoff_writeback is specified.
      
      The default value is still strongly recommended to most of users for
      most of workloads. But for people who want to do research on bcache
      writeback perforamnce tuning, they may have chance to specify more
      flexible writeback_percent in range [0, 70].
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc38ca7e
    • Coly Li's avatar
      bcache: make cutoff_writeback and cutoff_writeback_sync tunable · 9aaf5165
      Coly Li authored
      
      
      Currently the cutoff writeback and cutoff writeback sync thresholds are
      defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
      static values. Most of time these they work fine, but when people want
      to do research on bcache writeback mode performance tuning, there is no
      chance to modify the soft and hard cutoff writeback values.
      
      This patch introduces two module parameters bch_cutoff_writeback_sync
      and bch_cutoff_writeback which permit people to tune the values when
      loading bcache.ko. If they are not specified by module loading, current
      values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
      default and nothing changes.
      
      When people want to tune this two values,
      - cutoff_writeback can be set in range [1, 70]
      - cutoff_writeback_sync can be set in range [1, 90]
      - cutoff_writeback always <= cutoff_writeback_sync
      
      The default values are strongly recommended to most of users for most of
      workloads. Anyway, if people wants to take their own risk to do research
      on new writeback cutoff tuning for their own workload, now they can make
      it.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9aaf5165
    • Coly Li's avatar
      bcache: add MODULE_DESCRIPTION information · 009673d0
      Coly Li authored
      
      
      This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
      add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").
      
      This is preparation for adding module parameters.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      009673d0
Loading