- Apr 16, 2017
-
-
Dan Carpenter authored
WARN_ON() takes a condition, not an error message. I slightly tweaked some conditions so hopefully it's more clear. Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Carpenter authored
These labels are reversed so we could end up dereferencing an error pointer or leaking. Fixes: 7f347ba6bb3a ("lightnvm: physical block device (pblk) target") Signed-off-by:
Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
This patch introduces pblk, a host-side translation layer for Open-Channel SSDs to expose them like block devices. The translation layer allows data placement decisions, and I/O scheduling to be managed by the host, enabling users to optimize the SSD for their specific workloads. An open-channel SSD has a set of LUNs (parallel units) and a collection of blocks. Each block can be read in any order, but writes must be sequential. Writes may also fail, and if a block requires it, must also be reset before new writes can be applied. To manage the constraints, pblk maintains a logical to physical address (L2P) table, write cache, garbage collection logic, recovery scheme, and logic to rate-limit user I/Os versus garbage collection I/Os. The L2P table is fully-associative and manages sectors at a 4KB granularity. Pblk stores the L2P table in two places, in the out-of-band area of the media and on the last page of a line. In the cause of a power failure, pblk will perform a scan to recover the L2P table. The user data is organized into lines. A line is data striped across blocks and LUNs. The lines enable the host to reduce the amount of metadata to maintain besides the user data and makes it easier to implement RAID or erasure coding in the future. pblk implements multi-tenant support and can be instantiated multiple times on the same drive. Each instance owns a portion of the SSD - both regarding I/O bandwidth and capacity - providing I/O isolation for each case. Finally, pblk also exposes a sysfs interface that allows user-space to peek into the internals of pblk. The interface is available at /dev/block/*/pblk/ where * is the block device name exposed. This work also contains contributions from: Matias Bjørling <matias@cnexlabs.com> Simon A. F. Lund <slund@cnexlabs.com> Young Tack Jin <youngtack.jin@gmail.com> Huaicheng Li <huaicheng@cs.uchicago.edu> Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Convert sprintf calls to strlcpy in order to make possible buffer overflow more obvious. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
sector_t is always unsigned, therefore avoid < 0 checks on it. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Clean unused variable on lightnvm core. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Prefix the nvm_free static function with a missing static keyword. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Target initialization has two responsibilities: creating the target partition and instantiating the target. This patch enables to create a factory partition (e.g., do not trigger recovery on the given target). This is useful for target development and for being able to restore the device state at any moment in time without requiring a full-device erase. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
The NVMe I/O command control bits are 16 bytes, but is interpreted as 32 bytes in the lightnvm user I/O data path. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Reorder disk allocation such that the disk structure can be put safely. Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
The dev->lun_map bits are cleared twice if an target init error occurs. First in the target clean routine, and then next in the nvm_tgt_create error function. Make sure that it is only cleared once by extending nvm_remove_tgt_devi() with a clear bit, such that clearing of bits can ignored when cleaning up a successful initialized target. Signed-off-by:
Javier González <javier@cnexlabs.com> Fix style. Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
NeilBrown authored
mempool_alloc() cannot fail if the gfp flags allow it to sleep, and both GFP_KERNEL and GFP_NOIO allows for sleeping. So rrpc_move_valid_pages() and rrpc_make_rq() don't need to test the return value. Signed-off-by:
NeilBrown <neilb@suse.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Matias Bjørling authored
The asserts in _nvme_nvm_check_size are not compiled due to the function not begin called. Make sure that it is called, and also fix the wrong sizes of asserts for nvme_nvm_addr_format, and nvme_nvm_bb_tbl, which checked for number of bits instead of bytes. Reported-by:
Scott Bauer <scott.bauer@intel.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Free the reverse mapping table correctly on target tear down Signed-off-by:
Javier González <javier@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Javier González authored
Until now erases have been submitted as synchronous commands through a dedicated erase function. In order to enable targets implementing asynchronous erases, refactor the erase path so that it uses the normal async I/O submission functions. If a target requires sync I/O, it can implement it internally. Also, adapt rrpc to use the new erase path. Signed-off-by:
Javier González <javier@cnexlabs.com> Fixed spelling error. Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Scott Bauer authored
There are two closely named structs in lightnvm: struct nvme_nvm_addr_format and struct nvme_addr_format. The first struct has 4 reserved bytes at the end, the second does not. (gdb) p sizeof(struct nvme_nvm_addr_format) $1 = 16 (gdb) p sizeof(struct nvm_addr_format) $2 = 12 In the nvme_nvm_identify function we memcpy from the larger struct to the smaller struct. We incorrectly pass the length of the larger struct and overflow by 4 bytes, lets not do that. Signed-off-by:
Scott Bauer <scott.bauer@intel.com> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christophe JAILLET authored
According to error handling in this function, it is likely that going to 'out' was expected here. Signed-off-by:
Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by:
Matias Bjørling <matias@cnexlabs.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Apr 14, 2017
-
-
Christoph Hellwig authored
This drivers was added in 2008, but as far as a I can tell we never had a single platform that actually registered resources for the platform driver. It's also been unmaintained for a long time and apparently has a ATA mode that can be driven using the IDE/libata subsystem. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Apr 08, 2017
-
-
Martin K. Petersen authored
Separating discards and zeroout operations allows us to remove the LBPRZ block zeroing constraints from discards and honor the device preferences for UNMAP commands. If supported by the device, we'll also choose UNMAP over one of the WRITE SAME variants for discards. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Now that zeroout and discards are distinct operations we need to separate the policy of choosing the appropriate command. Create a zeroing_mode which can be one of: write: Zeroout assist not present, use regular WRITE writesame: Allow WRITE SAME(10/16) with a zeroed payload writesame_16_unmap: Allow WRITE SAME(16) with UNMAP writesame_10_unmap: Allow WRITE SAME(10) with UNMAP The last two are conditional on the device being thin provisioned with LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively. Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And if none of the _unmap variants are supported, regular WRITE SAME will be used if the device supports it. The zeroout_mode is exported in sysfs and the detected mode for a given device can be overridden using the string constants above. With this change in place we can now issue WRITE SAME(16) with UNMAP set for block zeroing applications that require hard guarantees and logical_block_size granularity. And at the same time use the UNMAP command with the device's preferred granulary and alignment for discard operations. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can kill this hack. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It seems like DRBD assumes its on the wire TRIM request always zeroes data. Use that fact to implement REQ_OP_WRITE_ZEROES. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
drbd always wants its discard wire operations to zero the blocks, so use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of reinventing it poorly. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
mmc only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
rsxx only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
rbd only supports discarding on large alignments, so the zeroing code would always fall back to explicit writings of zeroes. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It's just a in-driver reimplementation of writing zeroes to the pages, which fails if the discards aren't page aligned. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It's identical to discard as hole punches will always leave us with zeroes on reads. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Just the same as discard if the block size equals the system page size. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
But now for the real NVMe Write Zeroes yet, just to get rid of the discard abuse for zeroing. Also rename the quirk flag to be a bit more self-explanatory. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Try to use a write same with unmap bit variant if the device supports it and the caller allows for it. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with similar semantics, but without referring to diѕcard. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
It seems like the code currently passes whatever it was using for writes to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't need any payload. Untested, and confused by the code, maybe someone who understands it better than me can help.. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Fix up do_region to not allocate a bio_vec for discards. We've got rid of the discard payload allocated by the caller years ago. Obviously this wasn't actually harmful given how long it's been there, but it's still good to avoid the pointless allocation. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Copy & paste from the REQ_OP_WRITE_SAME code. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Split sd_setup_discard_cmnd into one function per provisioning type. While this creates some very slight duplication of boilerplate code it keeps the code modular for additions of new provisioning types, and for reusing the write same functions for the upcoming scsi implementation of the Write Zeroes operation. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Hannes Reinecke <hare@suse.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Apr 07, 2017
-
-
Bart Van Assche authored
While running the srp-test software I noticed that request processing stalls sporadically at the beginning of a test, namely when mkfs is run against a dm-mpath device. Every time when that happened the following command was sufficient to resume request processing: echo run >/sys/kernel/debug/block/dm-0/state This patch avoids that such request processing stalls occur. The test I ran is as follows: while srp-test/run_tests -d -r 30 -t 02-mq; do :; done Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Bart Van Assche authored
If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block driver that implements that function is responsible for rerunning the hardware queue once requests can be queued again successfully. commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq() for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions that are intended to rerun a busy queue such that these examine all hardware queues instead of only stopped queues. Since no other functions than scsi_internal_device_block() and scsi_internal_device_unblock() should ever stop or restart a SCSI queue, change the blk_mq_delay_queue() call into a blk_mq_delay_run_hw_queue() call. Fixes: commit 52d7f1b5 ("blk-mq: Avoid that requeueing starts stopped queues") Fixes: commit 7e79dadc ("blk-mq: stop hardware queue in blk_mq_delay_queue()") Signed-off-by:
Bart Van Assche <bart.vanassche@sandisk.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-