- Dec 13, 2018
-
-
Shenghui Wang authored
A fresh backing device is not attached to any cache_set, and has no writeback kthread created until first attached to some cache_set. But bch_cached_dev_writeback_init run " dc->writeback_running = true; WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING, &dc->disk.flags)); " for any newly formatted backing devices. For a fresh standalone backing device, we can get something like following even if no writeback kthread created: ------------------------ /sys/block/bcache0/bcache# cat writeback_running 1 /sys/block/bcache0/bcache# cat writeback_rate_debug rate: 512.0k/sec dirty: 0.0k target: 0.0k proportional: 0.0k integral: 0.0k change: 0.0k/sec next io: -15427384ms The none ZERO fields are misleading as no alive writeback kthread yet. Set dc->writeback_running false as no writeback thread created in bch_cached_dev_writeback_init(). We have writeback thread created and woken up in bch_cached_dev_writeback _start(). Set dc->writeback_running true before bch_writeback_queue() called, as a writeback thread will check if dc->writeback_running is true before writing back dirty data, and hung if false detected. After the change, we can get the following output for a fresh standalone backing device: ----------------------- /sys/block/bcache0/bcache$ cat writeback_running 0 /sys/block/bcache0/bcache# cat writeback_rate_debug rate: 0.0k/sec dirty: 0.0k target: 0.0k proportional: 0.0k integral: 0.0k change: 0.0k/sec next io: 0ms v1 -> v2: Set dc->writeback_running before bch_writeback_queue() called, Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
We have struct cached_dev allocated by kzalloc in register_bcache(), which initializes all the fields of cached_dev with 0s. And commit ce4c3e19 ("bcache: Replace bch_read_string_list() by __sysfs_match_string()") has remove the string "default". Update the comment. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
commit 220bb38c ("bcache: Break up struct search") introduced changes to struct search and s->iop. bypass/bio are fields of struct data_insert_op now. Update the comment. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
debugfs_remove and debugfs_remove_recursive will check if the dentry pointer is NULL or ERR, and will do nothing in that case. Remove the check in cache_set_free and bch_debug_init. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Shenghui Wang authored
We have the following define for btree iterator: struct btree_iter { size_t size, used; #ifdef CONFIG_BCACHE_DEBUG struct btree_keys *b; #endif struct btree_iter_set { struct bkey *k, *end; } data[MAX_BSETS]; }; We can see that the length of data[] field is static MAX_BSETS, which is defined as 4 currently. But a btree node on disk could have too many bsets for an iterator to fit on the stack - maybe far more that MAX_BSETS. Have to dynamically allocate space to host more btree_iter_sets. bch_cache_set_alloc() will make sure the pool cache_set->fill_iter can allocate an iterator equipped with enough room that can host (sb.bucket_size / sb.block_size) btree_iter_sets, which is more than static MAX_BSETS. bch_btree_node_read_done() will use that pool to allocate one iterator, to host many bsets in one btree node. Add more comment around cache_set->fill_iter to make code less confusing. Signed-off-by:
Shenghui Wang <shhuiw@foxmail.com> Signed-off-by:
Coly Li <colyli@suse.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 11, 2018
-
-
Igor Konopko authored
Ehen using pblk with 0 sized metadata both ppa list and meta list points to the same memory since pblk_dma_meta_size() returns 0 in that case. This patch fix that issue by ensuring that pblk_dma_meta_size() always returns space equal to sizeof(struct pblk_sec_meta) and thus ppa list and meta list points to different memory address. Even that in that case drive does not really care about meta_list pointer, this is the easiest way to fix that issue without introducing changes in many places in the code just for 0 sized metadata case. The same approach needs to be also done for pblk_get_sec_meta() since we also cannot point to the same memory address in meta buffer when we are using it for pblk recovery process Reported-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Tested-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
pblk performs recovery of open lines by storing the LBA in the per LBA metadata field. Recovery therefore only works for drives that has this field. This patch adds support for packed metadata, which store l2p mapping for open lines in last sector of every write unit and enables drives without per IO metadata to recover open lines. After this patch, drives with OOB size <16B will use packed metadata and metadata size larger than16B will continue to use the device per IO metadata. Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Currently pblk only check the size of I/O metadata and does not take into account if this metadata is in a separate buffer or interleaved in a single metadata buffer. In reality only the first scenario is supported, where second mode will break pblk functionality during any IO operation. This patch prevents pblk to be instantiated in case device only supports interleaved metadata. Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Currently lightnvm and pblk uses single DMA pool, for which the entry size always is equal to PAGE_SIZE. The contents of each entry allocated from the DMA pool consists of a PPA list (8bytes * 64), leaving 56bytes * 64 space for metadata. Since the metadata field can be bigger, such as 128 bytes, the static size does not cover this use-case. This patch adds support for I/O metadata above 56 bytes by changing DMA pool size based on device meta size and allows pblk to use OOB metadata >=16B. Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
pblk currently assumes that size of OOB metadata on drive is always equal to size of pblk_sec_meta struct. This commit add helpers which will allow to handle different sizes of OOB metadata on drive in the future. After this patch only OOB metadata equal to 16 bytes is supported. Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Igor Konopko authored
Currently DMA allocated memory is reused on partial read for lba_list_mem and lba_list_media arrays. In preparation for dynamic DMA pool sizes we need to move this arrays into pblk_pr_ctx structures. Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Igor Konopko <igor.j.konopko@intel.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Javier González authored
The current kref implementation around pblk global caches triggers a false positive on refcount_inc_checked() (when called) as the kref is initialized to 0. Instead of usint kref_inc() on a 0 reference, which is in principle correct, use kref_init() to avoid the check. This is also more explicit about what actually happens on cache creation. In the process, do a small refactoring to use kref helpers. Fixes: 1864de94 "lightnvm: pblk: stop recreating global caches" Signed-off-by:
Javier González <javier@cnexlabs.com> Reviewed-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Matias Bjørling authored
Currently the geometry of an OCSSD is enumerated using a two step approach: First, nvm_register is called, the OCSSD identify command is issued, and second the geometry sos and csecs values are read either from the OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data structure if it is a 2.0 device. This patch recombines it into a single step, such that nvm_register can use the csecs and sos fields independent of which version is used. This enables one to dynamically size the lightnvm subsystem dma pool. Reviewed-by:
Igor Konopko <igor.j.konopko@intel.com> Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Javier González authored
pblk's recovery path is single threaded and therefore a number of assumptions regarding concurrency can be made. To avoid confusion, make this explicit with a couple of comments in the code. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hua Su authored
Protect the list_add on the pblk_line_init_bb() error path in case this code is used for some other purpose in the future. Signed-off-by:
Hua Su <suhua.tanke@gmail.com> Reviewed-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hua Su authored
Signed-off-by:
Hua Su <suhua.tanke@gmail.com> Updated description. Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Remove the call to pblk_line_replace_data as it returns directly because we have not set l_mg->data_next yet. Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
The chunk metadata is allocated with vmalloc, so we need to use vfree to free it. Fixes: 090ee26f ("lightnvm: use internal allocation for chunk log page") Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
ADDR_POOL_SIZE is not used anymore, so remove the macro. Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
In a worst-case scenario (random writes), OP% of sectors in each line will be invalid, and we will then need to move data out of 100/OP% lines to free a single line. So, to prevent the possibility of running out of lines, temporarily block user writes when there is less than 100/OP% free lines. Also ensure that pblk creation does not produce instances with insufficient over provisioning. Insufficient over-provising is not a problem on real hardware, but often an issue when running QEMU simulations (with few lines). 100 lines is enough to create a sane instance with the standard (11%) over provisioning. Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
If mapping fails (i.e. when running out of lines), handle the error and stop writing. Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Lines inflicted with write errors lines might be recovered if they have not been recycled after write error garbage collection. Ensure that the emeta accounting of valid lbas is correct for such lines to avoid recovery inconsistencies. Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
Make sure we only look up valid lba addresses on the resubmission path. If an lba is invalidated in the write buffer, that sector will be submitted to disk (as it is already mapped to a ppa), and that write might fail, resulting in a crash when trying to look up the lba in the mapping table (as the lba is marked as invalid). Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by:
Javier González <javier@javigon.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hans Holmberg authored
The check for chunk closes suffers from an off-by-one issue, leading to chunk close events not being traced. Fixes: 4c44abf4 ("lightnvm: pblk: add trace events for chunk states") Signed-off-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Geert Uytterhoeven authored
With gcc 4.1: drivers/lightnvm/core.c: In function ‘nvm_get_bb_meta’: drivers/lightnvm/core.c:977: warning: ‘ret’ may be used uninitialized in this function and drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_get_chk_meta’: drivers/nvme/host/lightnvm.c:580: warning: ‘ret’ may be used uninitialized in this function Indeed, if (for the former) the number of channels or LUNs is zero, or (for both) the passed number of chunks is zero, ret will be returned uninitialized. Fix this by preinitializing ret to zero. Fixes: aff3fb18 ("lightnvm: move bad block and chunk state logic to core") Fixes: a294c199 ("lightnvm: implement get log report chunk helpers") Signed-off-by:
Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Zhoujie Wu authored
The smeta area l2p mapping is empty, and actually the recovery procedure only need to restore data sector's l2p mapping. So ignore the smeta oob scan. Signed-off-by:
Zhoujie Wu <zjwu@marvell.com> Reviewed-by:
Javier González <javier@javigon.com> Reviewed-by:
Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by:
Matias Bjørling <mb@lightnvm.io> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mike Snitzer authored
The md->wait waitqueue is used by both bio-based and request-based DM. Commit dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO") lost sight of the requirement that dm_wait_for_completion() must work with all types of DM devices. Fix md_in_flight() to call the blk-mq or bio-based method accordingly. Fixes: dbd3bbd2 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO") Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Guenter reported an boot hang issue on HPPA after we default to 0 poll queues. We have two issues in the queue count calculations: 1) We don't separate the poll queues from the read/write queues. This is important, since the former doesn't need interrupts. 2) The adjust logic is broken. Adjust the poll queue count before doing nvme_calc_io_queues(). The poll queue count is only limited by the IO queue count we were able to get from the controller, not failures in the IRQ allocation loop. This leaves nvme_calc_io_queues() just adjusting the read/write queue map. Reported-by:
Reported-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagi@grimberg.me> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
After switching to percpu inflight counters, the inflight check is totally buggy. It's perfectly valid for some counters to be non-zero while having a total inflight IO count of 0, that's how these kinds of counters work (inc on one CPU, dec on another). Fix the md_in_flight() check to sum all counters before returning a false positive, potentially. While at it, remove the inflight read for IO completion. We don't need it, just wake anyone that's waiting for the IO count to drop to zero. The caller needs to re-check that value anyway when woken, which it does. Fixes: 6f757231 ("dm: remove the pending IO accounting") Acked-by:
Mike Snitzer <snitzer@redhat.com> Reported-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 10, 2018
-
-
Jens Axboe authored
For cases where we can only fail with IO in-flight, we should be using BLK_STS_DEV_RESOURCE instead of BLK_STS_RESOURCE. The latter refers to system wide resource constraints. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Arnd Bergmann authored
The "cmd_slot_unal" semaphore is never used in a blocking way but only as an atomic counter. Change the code to using atomic_dec_if_positive() as a better API. Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mikulas Patocka authored
Remove the "pending" atomic counters, that duplicate block-core's in_flight counters, and update md_in_flight() to look at percpu in_flight counters. Signed-off-by:
Mikulas Patocka <mpatocka@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mike Snitzer authored
All of part_stat_* and related methods are used with preempt disabled, so there is no need to pass cpu around to allow of them. Just call smp_processor_id() as needed. Suggested-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mike Snitzer authored
Now that request-based dm-multipath only supports blk-mq, make use of the newly introduced blk_mq_queue_busy() to check for outstanding IO -- rather than (ab)using the block core's in_flight counters. Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Mikulas Patocka authored
generic_start_io_acct and generic_end_io_acct already update the variable in_flight using atomic operations, so we don't have to overwrite them again. Signed-off-by:
Mikulas Patocka <mpatocka@redhat.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- Dec 09, 2018
-
-
Michael Chan authored
The CP rings are accounted differently on the new 57500 chips. There must be enough CP rings for the sum of RX and TX rings on the new chips. The current logic may be over-estimating the RX and TX rings. The output parameter max_cp should be the maximum NQs capped by MSIX vectors available for networking in the context of 57500 chips. The existing code which uses CMPL rings capped by the MSIX vectors works most of the time but is not always correct. Signed-off-by:
Michael Chan <michael.chan@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Michael Chan authored
The new 57500 chips have introduced the NQ structure in addition to the existing CP rings in all chips. We need to introduce a new bnxt_nq_rings_in_use(). On legacy chips, the 2 functions are the same and one will just call the other. On the new chips, they refer to the 2 separate ring structures. The new function is now called to determine the resource (NQ or CP rings) associated with MSIX that are in use. On 57500 chips, the RDMA driver does not use the CP rings so we don't need to do the subtraction adjustment. Fixes: 41e8d798 ("bnxt_en: Modify the ring reservation functions for 57500 series chips.") Signed-off-by:
Michael Chan <michael.chan@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Michael Chan authored
The new 57500 chips use 1 NQ per MSIX vector, whereas legacy chips use 1 CP ring per MSIX vector. To better unify this, add a resv_irqs field to struct bnxt_hw_resc. On legacy chips, we initialize resv_irqs with resv_cp_rings. On new chips, we initialize it with the allocated MSIX resources. Signed-off-by:
Michael Chan <michael.chan@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Michael Chan authored
Recent changes to support the 57500 devices have created this regression. The bnxt_hwrm_queue_qportcfg() call was moved to be called earlier before the RDMA support was determined, causing the CoS queues configuration to be set before knowing whether RDMA was supported or not. Fix it by moving it to the right place right after RDMA support is determined. Fixes: 98f04cf0 ("bnxt_en: Check context memory requirements from firmware.") Signed-off-by:
Michael Chan <michael.chan@broadcom.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Tarick Bedeir authored
rx_ppp and tx_ppp can be set between 0 and 255, so don't clamp to 1. Fixes: 6e8814ce ("net/mlx4_en: Fix mixed PFC and Global pause user control requests") Signed-off-by:
Tarick Bedeir <tarick@google.com> Reviewed-by:
Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-