Skip to content
  1. Dec 13, 2018
    • Shenghui Wang's avatar
      bcache: do not mark writeback_running too early · 79b79146
      Shenghui Wang authored
      
      
      A fresh backing device is not attached to any cache_set, and
      has no writeback kthread created until first attached to some
      cache_set.
      
      But bch_cached_dev_writeback_init run
      "
      	dc->writeback_running		= true;
      	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
      			&dc->disk.flags));
      "
      for any newly formatted backing devices.
      
      For a fresh standalone backing device, we can get something like
      following even if no writeback kthread created:
      ------------------------
      /sys/block/bcache0/bcache# cat writeback_running
      1
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		512.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	-15427384ms
      
      The none ZERO fields are misleading as no alive writeback kthread yet.
      
      Set dc->writeback_running false as no writeback thread created in
      bch_cached_dev_writeback_init().
      
      We have writeback thread created and woken up in bch_cached_dev_writeback
      _start(). Set dc->writeback_running true before bch_writeback_queue()
      called, as a writeback thread will check if dc->writeback_running is true
      before writing back dirty data, and hung if false detected.
      
      After the change, we can get the following output for a fresh standalone
      backing device:
      -----------------------
      /sys/block/bcache0/bcache$ cat writeback_running
      0
      /sys/block/bcache0/bcache# cat writeback_rate_debug
      rate:		0.0k/sec
      dirty:		0.0k
      target:		0.0k
      proportional:	0.0k
      integral:	0.0k
      change:		0.0k/sec
      next io:	0ms
      
      v1 -> v2:
        Set dc->writeback_running before bch_writeback_queue() called,
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79b79146
    • Shenghui Wang's avatar
      bcache: update comment in sysfs.c · 4e361e02
      Shenghui Wang authored
      
      
      We have struct cached_dev allocated by kzalloc in register_bcache(),
      which initializes all the fields of cached_dev with 0s. And commit
      ce4c3e19 ("bcache: Replace bch_read_string_list() by
      __sysfs_match_string()") has remove the string "default".
      
      Update the comment.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e361e02
    • Shenghui Wang's avatar
      bcache: update comment for bch_data_insert · 3db4d078
      Shenghui Wang authored
      
      
      commit 220bb38c ("bcache: Break up struct search") introduced
      changes to struct search and s->iop. bypass/bio are fields of struct
      data_insert_op now. Update the comment.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3db4d078
    • Shenghui Wang's avatar
      bcache: do not check if debug dentry is ERR or NULL explicitly on remove · ae171023
      Shenghui Wang authored
      
      
      debugfs_remove and debugfs_remove_recursive will check if the dentry
      pointer is NULL or ERR, and will do nothing in that case.
      
      Remove the check in cache_set_free and bch_debug_init.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ae171023
    • Shenghui Wang's avatar
      bcache: add comment for cache_set->fill_iter · d2f96f48
      Shenghui Wang authored
      
      
      We have the following define for btree iterator:
      	struct btree_iter {
      		size_t size, used;
      	#ifdef CONFIG_BCACHE_DEBUG
      		struct btree_keys *b;
      	#endif
      		struct btree_iter_set {
      			struct bkey *k, *end;
      		} data[MAX_BSETS];
      	};
      
      We can see that the length of data[] field is static MAX_BSETS, which is
      defined as 4 currently.
      
      But a btree node on disk could have too many bsets for an iterator to fit
      on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate
      space to host more btree_iter_sets.
      
      bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can
      allocate an iterator equipped with enough room that can host
      	(sb.bucket_size / sb.block_size)
      btree_iter_sets, which is more than static MAX_BSETS.
      
      bch_btree_node_read_done() will use that pool to allocate one iterator, to
      host many bsets in one btree node.
      
      Add more comment around cache_set->fill_iter to make code less confusing.
      
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d2f96f48
  2. Dec 11, 2018
    • Mike Snitzer's avatar
      dm: fix request-based dm's use of dm_wait_for_completion · c4576aed
      Mike Snitzer authored
      
      
      The md->wait waitqueue is used by both bio-based and request-based DM.
      Commit dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for
      outstanding IO") lost sight of the requirement that
      dm_wait_for_completion() must work with all types of DM devices.
      
      Fix md_in_flight() to call the blk-mq or bio-based method accordingly.
      
      Fixes: dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c4576aed
    • Jens Axboe's avatar
      dm: fix inflight IO check · b7934ba4
      Jens Axboe authored
      
      
      After switching to percpu inflight counters, the inflight check
      is totally buggy. It's perfectly valid for some counters to be
      non-zero while having a total inflight IO count of 0, that's how
      these kinds of counters work (inc on one CPU, dec on another).
      Fix the md_in_flight() check to sum all counters before returning
      a false positive, potentially.
      
      While at it, remove the inflight read for IO completion. We don't
      need it, just wake anyone that's waiting for the IO count to drop
      to zero. The caller needs to re-check that value anyway when woken,
      which it does.
      
      Fixes: 6f757231 ("dm: remove the pending IO accounting")
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b7934ba4
  3. Dec 10, 2018
  4. Dec 08, 2018
  5. Nov 16, 2018
  6. Nov 15, 2018
  7. Nov 02, 2018
  8. Oct 25, 2018
    • Damien Le Moal's avatar
      block: Introduce blk_revalidate_disk_zones() · bf505456
      Damien Le Moal authored
      
      
      Drivers exposing zoned block devices have to initialize and maintain
      correctness (i.e. revalidate) of the device zone bitmaps attached to
      the device request queue (seq_zones_bitmap and seq_zones_wlock).
      
      To simplify coding this, introduce a generic helper function
      blk_revalidate_disk_zones() suitable for most (and likely all) cases.
      This new function always update the seq_zones_bitmap and seq_zones_wlock
      bitmaps as well as the queue nr_zones field when called for a disk
      using a request based queue. For a disk using a BIO based queue, only
      the number of zones is updated since these queues do not have
      schedulers and so do not need the zone bitmaps.
      
      With this change, the zone bitmap initialization code in sd_zbc.c can be
      replaced with a call to this function in sd_zbc_read_zones(), which is
      called from the disk revalidate block operation method.
      
      A call to blk_revalidate_disk_zones() is also added to the null_blk
      driver for devices created with the zoned mode enabled.
      
      Finally, to ensure that zoned devices created with dm-linear or
      dm-flakey expose the correct number of zones through sysfs, a call to
      blk_revalidate_disk_zones() is added to dm_table_set_restrictions().
      
      The zone bitmaps allocated and initialized with
      blk_revalidate_disk_zones() are freed automatically from
      __blk_release_queue() using the block internal function
      blk_queue_free_zone_bitmaps().
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf505456
    • Christoph Hellwig's avatar
      block: add a report_zones method · e76239a3
      Christoph Hellwig authored
      
      
      Dispatching a report zones command through the request queue is a major
      pain due to the command reply payload rewriting necessary. Given that
      blkdev_report_zones() is executing everything synchronously, implement
      report zones as a block device file operation instead, allowing major
      simplification of the code in many places.
      
      sd, null-blk, dm-linear and dm-flakey being the only block device
      drivers supporting exposing zoned block devices, these drivers are
      modified to provide the device side implementation of the
      report_zones() block device file operation.
      
      For device mappers, a new report_zones() target type operation is
      defined so that the upper block layer calls blkdev_report_zones() can
      be propagated down to the underlying devices of the dm targets.
      Implementation for this new operation is added to the dm-linear and
      dm-flakey targets.
      
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [Damien]
      * Changed method block_device argument to gendisk
      * Various bug fixes and improvements
      * Added support for null_blk, dm-linear and dm-flakey.
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e76239a3
    • Damien Le Moal's avatar
      block: Introduce blkdev_nr_zones() helper · a91e1380
      Damien Le Moal authored
      
      
      Introduce the blkdev_nr_zones() helper function to get the total
      number of zones of a zoned block device. This number is always 0 for a
      regular block device (q->limits.zoned == BLK_ZONED_NONE case).
      
      Replace hard-coded number of zones calculation in dmz_get_zoned_device()
      with a call to this helper.
      
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a91e1380
  9. Oct 22, 2018
  10. Oct 18, 2018
    • Damien Le Moal's avatar
      dm zoned: fix various dmz_get_mblock() issues · 3d4e7383
      Damien Le Moal authored
      
      
      dmz_fetch_mblock() called from dmz_get_mblock() has a race since the
      allocation of the new metadata block descriptor and its insertion in
      the cache rbtree with the READING state is not atomic. Two different
      contexts requesting the same block may end up each adding two different
      descriptors of the same block to the cache.
      
      Another problem for this function is that the BIO for processing the
      block read is allocated after the metadata block descriptor is inserted
      in the cache rbtree. If the BIO allocation fails, the metadata block
      descriptor is freed without first being removed from the rbtree.
      
      Fix the first problem by checking again if the requested block is not in
      the cache right before inserting the newly allocated descriptor,
      atomically under the mblk_lock spinlock. The second problem is fixed by
      simply allocating the BIO before inserting the new block in the cache.
      
      Finally, since dmz_fetch_mblock() also increments a block reference
      counter, rename the function to dmz_get_mblock_slow(). To be symmetric
      and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and
      increment the block reference counter directly in that function rather
      than in dmz_get_mblock().
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      3d4e7383
    • Damien Le Moal's avatar
      dm zoned: fix metadata block ref counting · 33c2865f
      Damien Le Moal authored
      
      
      Since the ref field of struct dmz_mblock is always used with the
      spinlock of struct dmz_metadata locked, there is no need to use an
      atomic_t type. Change the type of the ref field to an unsigne
      integer.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      33c2865f
    • Heinz Mauelshagen's avatar
      dm raid: avoid bitmap with raid4/5/6 journal device · d857ad75
      Heinz Mauelshagen authored
      
      
      With raid4/5/6, journal device and write intent bitmap are mutually exclusive.
      
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d857ad75
    • Guoqing Jiang's avatar
      md-cluster: remove suspend_info · ea89238c
      Guoqing Jiang authored
      
      
      Previously, we allow multiple nodes can resync device, but we
      had changed it to only support one node can do resync at one
      time, but suspend_info is still used.
      
      Now, let's remove the structure and use suspend_lo/hi to record
      the range.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ea89238c
    • Guoqing Jiang's avatar
      md-cluster: send BITMAP_NEEDS_SYNC message if reshaping is interrupted · cb9ee154
      Guoqing Jiang authored
      
      
      We need to continue the reshaping if it was interrupted in
      original node. So original node should call resync_bitmap
      in case reshaping is aborted.
      
      Then BITMAP_NEEDS_SYNC message is broadcasted to other nodes,
      node which continues the reshaping should restart reshape from
      mddev->reshape_position instead of from the first beginning.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      cb9ee154
    • Guoqing Jiang's avatar
      md-cluster/bitmap: don't call md_bitmap_sync_with_cluster during reshaping stage · cbce6863
      Guoqing Jiang authored
      
      
      When reshape is happening in one node, other nodes could receive
      lots of RESYNCING messages, so md_bitmap_sync_with_cluster is called.
      
      Since the resyncing window is typically small in these RESYNCING
      messages, so WARN is always triggered, so we should not call the
      func when reshape is happening.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      cbce6863
    • Guoqing Jiang's avatar
      md-cluster/raid10: don't call remove_and_add_spares during reshaping stage · ca1e98e0
      Guoqing Jiang authored
      
      
      remove_and_add_spares is not needed if reshape is
      happening in another node, because raid10_add_disk
      called inside raid10_start_reshape would handle the
      role changes of disk. Plus, remove_and_add_spares
      can't deal with the role change due to reshape.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      ca1e98e0
    • Guoqing Jiang's avatar
      md-cluster/raid10: call update_size in md_reap_sync_thread · aefb2e5f
      Guoqing Jiang authored
      
      
      We need to change the capacity in all nodes after one node
      finishs reshape. And as we did before, we can't change the
      capacity directly in md_do_sync, instead, the capacity should
      be only changed in update_size or received CHANGE_CAPACITY
      msg.
      
      So master node calls update_size after completes reshape in
      md_reap_sync_thread, but we need to skip ops->update_size if
      MD_CLOSING is set since reshaping could not be finish.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      aefb2e5f
    • Guoqing Jiang's avatar
      md-cluster: introduce resync_info_get interface for sanity check · 5ebaf80b
      Guoqing Jiang authored
      
      
      Since the resync region from suspend_info means one node
      is reshaping this area, so the position of reshape_progress
      should be included in the area.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      5ebaf80b
    • Guoqing Jiang's avatar
      md-cluster/raid10: support add disk under grow mode · 7564beda
      Guoqing Jiang authored
      
      
      For clustered raid10 scenario, we need to let all the nodes
      know about that a new disk is added to the array, and the
      reshape caused by add new member just need to be happened in
      one node, but other nodes should know about the change.
      
      Since reshape means read data from somewhere (which is already
      used by array) and write data to unused region. Obviously, it
      is awful if one node is reading data from address while another
      node is writing to the same address. Considering we have
      implemented suspend writes in the resyncing area, so we can
      just broadcast the reading address to other nodes to avoid the
      trouble.
      
      For master node, it would call reshape_request then update sb
      during the reshape period. To avoid above trouble, we call
      resync_info_update to send RESYNC message in reshape_request.
      
      Then from slave node's view, it receives two type messages:
      1. RESYNCING message
      Slave node add the address (where master node reading data from)
      to suspend list.
      
      2. METADATA_UPDATED message
      Once slave nodes know the reshaping is started in master node,
      it is time to update reshape position and call start_reshape to
      follow master node's step. After reshape is done, only reshape
      position is need to be updated, so the majority task of reshaping
      is happened on the master node.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      7564beda
    • Guoqing Jiang's avatar
      md-cluster/raid10: resize all the bitmaps before start reshape · afd75628
      Guoqing Jiang authored
      
      
      To support add disk under grow mode, we need to resize
      all the bitmaps of each node before reshape, so that we
      can ensure all nodes have the same view of the bitmap of
      the clustered raid.
      
      So after the master node resized the bitmap, it broadcast
      a message to other slave nodes, and it checks the size of
      each bitmap are same or not by compare pages. We can only
      continue the reshaping after all nodes update the bitmap
      to the same size (by checking the pages), otherwise revert
      bitmap size to previous value.
      
      The resize_bitmaps interface and BITMAP_RESIZE message are
      introduced in md-cluster.c for the purpose.
      
      Reviewed-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      afd75628
    • Michał Mirosław's avatar
      dm crypt: make workqueue names device-specific · ed0302e8
      Michał Mirosław authored
      
      
      Make cpu-usage debugging easier by naming workqueues per device.
      
      Example ps output:
      
      root       413  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:0]
      root       414  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:0]
      root       415  0.0  0.0      0     0 ?        S    paź02   1:10  [dmcrypt_write/253:0]
      root       465  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd_io/253:2]
      root       466  0.0  0.0      0     0 ?        I<   paź02   0:00  [kcryptd/253:2]
      root       467  0.0  0.0      0     0 ?        S    paź02   2:06  [dmcrypt_write/253:2]
      root     15359  0.2  0.0      0     0 ?        I<   19:43   0:25  [kworker/u17:8-kcryptd/253:0]
      root     16563  0.2  0.0      0     0 ?        I<   20:10   0:18  [kworker/u17:0-kcryptd/253:2]
      root     23205  0.1  0.0      0     0 ?        I<   21:21   0:04  [kworker/u17:4-kcryptd/253:0]
      root     13383  0.1  0.0      0     0 ?        I<   21:32   0:02  [kworker/u17:2-kcryptd/253:2]
      root      2610  0.1  0.0      0     0 ?        I<   21:42   0:01  [kworker/u17:12-kcryptd/253:2]
      root     20124  0.1  0.0      0     0 ?        I<   21:56   0:01  [kworker/u17:1-kcryptd/253:2]
      
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ed0302e8
    • Michał Mirosław's avatar
      dm: add dm_table_device_name() · f349b0a3
      Michał Mirosław authored
      
      
      Add a shortcut for dm_device_name(dm_table_get_md(t)).
      
      Signed-off-by: default avatarMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      f349b0a3
    • Wenwen Wang's avatar
      dm ioctl: harden copy_params()'s copy_from_user() from malicious users · 800a7340
      Wenwen Wang authored
      
      
      In copy_params(), the struct 'dm_ioctl' is first copied from the user
      space buffer 'user' to 'param_kernel' and the field 'data_size' is
      checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload
      up to its 'data' member).  If the check fails, an error code EINVAL will be
      returned.  Otherwise, param_kernel->data_size is used to do a second copy,
      which copies from the same user-space buffer to 'dmi'.  After the second
      copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'.
      Given that the buffer 'user' resides in the user space, a malicious
      user-space process can race to change the content in the buffer between
      the two copies.  This way, the attacker can inject inconsistent data
      into 'dmi' (versus previously validated 'param_kernel').
      
      Fix redundant copying of 'minimum_data_size' from user-space buffer by
      using the first copy stored in 'param_kernel'.  Also remove the
      'data_size' check after the second copy because it is now unnecessary.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWenwen Wang <wang6495@umn.edu>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      800a7340
  11. Oct 16, 2018
  12. Oct 15, 2018
Loading