Skip to content
  1. Feb 09, 2019
    • Coly Li's avatar
      bcache: export backing_dev_name via sysfs · 926d1946
      Coly Li authored
      
      
      This patch export dc->backing_dev_name to sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
      tools may know the backing device name of this bcache device.
      
      Of cause it can be done by parsing sysfs links, but this method can be
      much simpler to find the link between bcache device and backing device.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      926d1946
    • Coly Li's avatar
      bcache: not use hard coded memset size in bch_cache_accounting_clear() · 83ff9318
      Coly Li authored
      
      
      In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
      used in memset(). It is because in struct cache_stats, there are 7
      atomic_t type members. This is not good when new members added into
      struct stats, the hard coded number will only clear part of memory.
      
      This patch replaces 'sizeof(unsigned long) * 7' by more generic
      'sizeof(struct cache_stats))', to avoid potential error if new
      member added into struct cache_stats.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      83ff9318
    • Daniel Axtens's avatar
      bcache: never writeback a discard operation · 9951379b
      Daniel Axtens authored
      Some users see panics like the following when performing fstrim on a
      bcached volume:
      
      [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  530.183928] #PF error: [normal kernel read fault]
      [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
      [  530.750887] Oops: 0000 [#1] SMP PTI
      [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
      [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
      [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
      [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
      8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
      [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
      [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
      [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
      [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
      [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
      [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
      [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
      [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
      [  535.851759] Call Trace:
      [  535.970308]  ? mempool_alloc_slab+0x15/0x20
      [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
      [  536.403399]  blk_mq_make_request+0x97/0x4f0
      [  536.607036]  generic_make_request+0x1e2/0x410
      [  536.819164]  submit_bio+0x73/0x150
      [  536.980168]  ? submit_bio+0x73/0x150
      [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
      [  537.391595]  ? _cond_resched+0x1a/0x50
      [  537.573774]  submit_bio_wait+0x59/0x90
      [  537.756105]  blkdev_issue_discard+0x80/0xd0
      [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
      [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
      [  538.324087]  ext4_ioctl+0xea4/0x1530
      [  538.497712]  ? _copy_to_user+0x2a/0x40
      [  538.679632]  do_vfs_ioctl+0xa6/0x600
      [  538.853127]  ? __do_sys_newfstat+0x44/0x70
      [  539.051951]  ksys_ioctl+0x6d/0x80
      [  539.212785]  __x64_sys_ioctl+0x1a/0x20
      [  539.394918]  do_syscall_64+0x5a/0x110
      [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      We have observed it where both:
      1) LVM/devmapper is involved (bcache backing device is LVM volume) and
      2) writeback cache is involved (bcache cache_mode is writeback)
      
      On one machine, we can reliably reproduce it with:
      
       # echo writeback > /sys/block/bcache0/bcache/cache_mode
         (not sure whether above line is required)
       # mount /dev/bcache0 /test
       # for i in {0..10}; do
      	file="$(mktemp /test/zero.XXX)"
      	dd if=/dev/zero of="$file" bs=1M count=256
      	sync
      	rm $file
          done
        # fstrim -v /test
      
      Observing this with tracepoints on, we see the following writes:
      
      fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
      <crash>
      
      Note the final one has different hit/bypass flags.
      
      This is because in should_writeback(), we were hitting a case where
      the partial stripe condition was returning true and so
      should_writeback() was returning true early.
      
      If that hadn't been the case, it would have hit the would_skip test, and
      as would_skip == s->iop.bypass == true, should_writeback() would have
      returned false.
      
      Looking at the git history from 'commit 72c27061 ("bcache: Write out
      full stripes")', it looks like the idea was to optimise for raid5/6:
      
             * If a stripe is already dirty, force writes to that stripe to
      	 writeback mode - to help build up full stripes of dirty data
      
      To fix this issue, make sure that should_writeback() on a discard op
      never returns true.
      
      More details of debugging:
      https://www.spinics.net/lists/linux-bcache/msg06996.html
      
      Previous reports:
       - https://bugzilla.kernel.org/show_bug.cgi?id=201051
       - https://bugzilla.kernel.org/show_bug.cgi?id=196103
       - https://www.spinics.net/lists/linux-bcache/msg06885.html
      
      
      
      (Coly Li: minor modification to follow maximum 75 chars per line rule)
      
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 72c27061 ("bcache: Write out full stripes")
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9951379b
  2. Dec 13, 2018
    • Guoju Fang's avatar
      bcache: print number of keys in trace_bcache_journal_write · e78bd0d2
      Guoju Fang authored
      
      
      Sometimes flush journal may be very frequent, so it's useful to dump
      number of keys every time write journal.
      
      Signed-off-by: default avatarGuoju Fang <fangguoju@gmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e78bd0d2
    • Coly Li's avatar
      bcache: set writeback_percent in a flexible range · cc38ca7e
      Coly Li authored
      
      
      Because CUTOFF_WRITEBACK is defined as 40, so before the changes of
      dynamic cutoff writeback values, writeback_percent is limited to [0,
      CUTOFF_WRITEBACK]. Any value larger than CUTOFF_WRITEBACK will be fixed
      up to 40.
      
      Now cutof writeback limit is a dynamic value bch_cutoff_writeback, so
      the range of writeback_percent can be a more flexible range as [0,
      bch_cutoff_writeback]. The flexibility is, it can be expended to a
      larger or smaller range than [0, 40], depends on how value
      bch_cutoff_writeback is specified.
      
      The default value is still strongly recommended to most of users for
      most of workloads. But for people who want to do research on bcache
      writeback perforamnce tuning, they may have chance to specify more
      flexible writeback_percent in range [0, 70].
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc38ca7e
    • Coly Li's avatar
      bcache: make cutoff_writeback and cutoff_writeback_sync tunable · 9aaf5165
      Coly Li authored
      
      
      Currently the cutoff writeback and cutoff writeback sync thresholds are
      defined by CUTOFF_WRITEBACK (40) and CUTOFF_WRITEBACK_SYNC (70) as
      static values. Most of time these they work fine, but when people want
      to do research on bcache writeback mode performance tuning, there is no
      chance to modify the soft and hard cutoff writeback values.
      
      This patch introduces two module parameters bch_cutoff_writeback_sync
      and bch_cutoff_writeback which permit people to tune the values when
      loading bcache.ko. If they are not specified by module loading, current
      values CUTOFF_WRITEBACK_SYNC and CUTOFF_WRITEBACK will be used as
      default and nothing changes.
      
      When people want to tune this two values,
      - cutoff_writeback can be set in range [1, 70]
      - cutoff_writeback_sync can be set in range [1, 90]
      - cutoff_writeback always <= cutoff_writeback_sync
      
      The default values are strongly recommended to most of users for most of
      workloads. Anyway, if people wants to take their own risk to do research
      on new writeback cutoff tuning for their own workload, now they can make
      it.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9aaf5165
    • Coly Li's avatar
      bcache: add MODULE_DESCRIPTION information · 009673d0
      Coly Li authored
      
      
      This patch moves MODULE_AUTHOR and MODULE_LICENSE to end of super.c, and
      add MODULE_DESCRIPTION("Bcache: a Linux block layer cache").
      
      This is preparation for adding module parameters.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      009673d0
    • Coly Li's avatar
      bcache: option to automatically run gc thread after writeback · 7a671d8e
      Coly Li authored
      
      
      The option gc_after_writeback is disabled by default, because garbage
      collection will discard SSD data which drops cached data.
      
      Echo 1 into /sys/fs/bcache/<UUID>/internal/gc_after_writeback will
      enable this option, which wakes up gc thread when writeback accomplished
      and all cached data is clean.
      
      This option is helpful for people who cares writing performance more. In
      heavy writing workload, all cached data can be clean only happens when
      writeback thread cleans all cached data in I/O idle time. In such
      situation a following gc running may help to shrink bcache B+ tree and
      discard more clean data, which may be helpful for future writing
      requests.
      
      If you are not sure whether this is helpful for your own workload,
      please leave it as disabled by default.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7a671d8e
    • Coly Li's avatar
      bcache: introduce force_wake_up_gc() · cb07ad63
      Coly Li authored
      
      
      Garbage collection thread starts to work when c->sectors_to_gc is
      negative value, otherwise nothing will happen even the gc thread is
      woken up by wake_up_gc().
      
      force_wake_up_gc() sets c->sectors_to_gc to -1 before calling
      wake_up_gc(), then gc thread may have chance to run if no one else sets
      c->sectors_to_gc to a positive value before gc_should_run().
      
      This routine can be called where the gc thread is woken up and required
      to run in force.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cb07ad63
    • Shenghui Wang's avatar
      bcache: cannot set writeback_running via sysfs if no writeback kthread created · f383ae30
      Shenghui Wang authored
      
      
      "echo 1 > writeback_running" marks writeback_running even if no
      writeback kthread created as "d_strtoul(writeback_running)" will simply
      set dc-> writeback_running without checking the existence of
      dc->writeback_thread.
      
      Add check for setting writeback_running via sysfs: if no writeback
      kthread available, reject setting to 1.
      
      v2 -> v3:
        * Make message on wrong assignment more clear.
        * Print name of bcache device instead of name of backing device.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f383ae30
    • Shenghui Wang's avatar
      bcache: do not mark writeback_running too early · 79b79146
      Shenghui Wang authored
      
      
      A fresh backing device is not attached to any cache_set, and
      has no writeback kthread created until first attached to some
      cache_set.
      
      But bch_cached_dev_writeback_init run
      "
      	dc->writeback_running		= true;
      	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
      			&dc->disk.flags));
      "
      for any newly formatted backing devices.
      
      For a fresh standalone backing device, we can get something like
      following even if no writeback kthread created:
      ------------------------
      /sys/block/bcache0/bcache# cat writeback_running
      1
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		512.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	-15427384ms
      
      The none ZERO fields are misleading as no alive writeback kthread yet.
      
      Set dc->writeback_running false as no writeback thread created in
      bch_cached_dev_writeback_init().
      
      We have writeback thread created and woken up in bch_cached_dev_writeback
      _start(). Set dc->writeback_running true before bch_writeback_queue()
      called, as a writeback thread will check if dc->writeback_running is true
      before writing back dirty data, and hung if false detected.
      
      After the change, we can get the following output for a fresh standalone
      backing device:
      -----------------------
      /sys/block/bcache0/bcache$ cat writeback_running
      0
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		0.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	0ms
      
      v1 -> v2:
        Set dc->writeback_running before bch_writeback_queue() called,
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79b79146
    • Shenghui Wang's avatar
      bcache: update comment in sysfs.c · 4e361e02
      Shenghui Wang authored
      
      
      We have struct cached_dev allocated by kzalloc in register_bcache(),
      which initializes all the fields of cached_dev with 0s. And commit
      ce4c3e19 ("bcache: Replace bch_read_string_list() by
      __sysfs_match_string()") has remove the string "default".
      
      Update the comment.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e361e02
    • Shenghui Wang's avatar
      bcache: update comment for bch_data_insert · 3db4d078
      Shenghui Wang authored
      
      
      commit 220bb38c ("bcache: Break up struct search") introduced
      changes to struct search and s->iop. bypass/bio are fields of struct
      data_insert_op now. Update the comment.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3db4d078
    • Shenghui Wang's avatar
      bcache: do not check if debug dentry is ERR or NULL explicitly on remove · ae171023
      Shenghui Wang authored
      
      
      debugfs_remove and debugfs_remove_recursive will check if the dentry
      pointer is NULL or ERR, and will do nothing in that case.
      
      Remove the check in cache_set_free and bch_debug_init.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ae171023
    • Shenghui Wang's avatar
      bcache: add comment for cache_set->fill_iter · d2f96f48
      Shenghui Wang authored
      
      
      We have the following define for btree iterator:
      	struct btree_iter {
      		size_t size, used;
      	#ifdef CONFIG_BCACHE_DEBUG
      		struct btree_keys *b;
      	#endif
      		struct btree_iter_set {
      			struct bkey *k, *end;
      		} data[MAX_BSETS];
      	};
      
      We can see that the length of data[] field is static MAX_BSETS, which is
      defined as 4 currently.
      
      But a btree node on disk could have too many bsets for an iterator to fit
      on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
      space to host more btree_iter_sets.
      
      bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
      allocate an iterator equipped with enough room that can host
      	(sb.bucket_size / sb.block_size)
      btree_iter_sets, which is more than static MAX_BSETS.
      
      bch_btree_node_read_done() will use that pool to allocate one iterator, to
      host many bsets in one btree node.
      
      Add more comment around cache_set->fill_iter to make code less confusing.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d2f96f48
  3. Oct 08, 2018
  4. Sep 27, 2018
    • Guoju Fang's avatar
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang authored
      
      
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      
      Signed-off-by: default avatarGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0f843e65
  5. Aug 22, 2018
  6. Aug 11, 2018
Loading