- Nov 25, 2015
-
-
Jens Axboe authored
This reverts commit 1b2ff19e. Jan writes: -- Thanks for report! After some investigation I found out we allocate elevator specific data in __get_request() only for non-flush requests. And this is actually required since the flush machinery uses the space in struct request for something else. Doh. So my patch is just wrong and not easy to fix since at the time __get_request() is called we are not sure whether the flush machinery will be used in the end. Jens, please revert 1b2ff19e. Thanks! I'm somewhat surprised that you can reliably hit the race where flushing gets disabled for the device just while the request is in flight. But I guess during boot it makes some sense. -- So let's just revert it, we can fix the queue run manually after the fact. This race is rare enough that it didn't trigger in testing, it requires the specific disable-while-in-flight scenario to trigger.
-
- Nov 24, 2015
-
-
Christoph Hellwig authored
We only added the request to the request list for the !blk-mq case, so we should only delete it in that case as well. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
We had seen lots of reports of this kind issue, so add one warnning in blk-merge, then it can be triggered easily and avoid to depend on warning/bug from drivers. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Commit bdced438(block: setup bi_phys_segments after splitting) introduces function of computing bio->bi_phys_segments during bio splitting. Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size arn't computed, so too many physical segments may be obtained for one request since both the two are used to check if one segment across two bios can be possible. This patch fixes the issue by computing the two variables in blk_bio_segment_split(). Fixes: bdced438(block: setup bi_phys_segments after splitting) Reported-by:
Michael Ellerman <mpe@ellerman.id.au> Reported-by:
Mark Salter <msalter@redhat.com> Tested-by:
Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by:
Mark Salter <msalter@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp) always points to the iterator local variable, which is obviously wrong, so fix it by pointing to the local variable of 'bvprv'. Fixes: 5014c311(block: fix bogus compiler warnings in blk-merge.c) Cc: stable@kernel.org #4.3 Reported-by:
Michael Ellerman <mpe@ellerman.id.au> Reported-by:
Mark Salter <msalter@redhat.com> Tested-by:
Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by:
Mark Salter <msalter@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Nov 21, 2015
-
-
Jens Axboe authored
Liu reported that running certain parts of xfstests threw the following error: BUG: sleeping function called from invalid context at mm/page_alloc.c:3190 in_atomic(): 1, irqs_disabled(): 0, pid: 6, name: kworker/u16:0 3 locks held by kworker/u16:0/6: #0: ("writeback"){++++.+}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8107f083>] process_one_work+0x173/0x730 #2: (&type->s_umount_key#44){+++++.}, at: [<ffffffff811e6805>] trylock_super+0x25/0x60 CPU: 5 PID: 6 Comm: kworker/u16:0 Tainted: G OE 4.3.0+ #3 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011 Workqueue: writeback wb_workfn (flush-btrfs-108) ffffffff81a3abab ffff88042e282ba8 ffffffff8130191b ffffffff81a3abab 0000000000000c76 ffff88042e282ba8 ffff88042e27c180 ffff88042e282bd8 ffffffff8108ed95 ffff880400000004 0000000000000000 0000000000000c76 Call Trace: [<ffffffff8130191b>] dump_stack+0x4f/0x74 [<ffffffff8108ed95>] ___might_sleep+0x185/0x240 [<ffffffff8108eea2>] __might_sleep+0x52/0x90 [<ffffffff811817e8>] __alloc_pages_nodemask+0x268/0x410 [<ffffffff8109a43c>] ? sched_clock_local+0x1c/0x90 [<ffffffff8109a6d1>] ? local_clock+0x21/0x40 [<ffffffff810b9eb0>] ? __lock_release+0x420/0x510 [<ffffffff810b534c>] ? __lock_acquired+0x16c/0x3c0 [<ffffffff811ca265>] alloc_pages_current+0xc5/0x210 [<ffffffffa0577105>] ? rbio_is_full+0x55/0x70 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff81666d50>] ? _raw_spin_unlock_irqrestore+0x40/0x60 [<ffffffffa0578c0a>] full_stripe_write+0x5a/0xc0 [btrfs] [<ffffffffa0578ca9>] __raid56_parity_write+0x39/0x60 [btrfs] [<ffffffffa0578deb>] run_plug+0x11b/0x140 [btrfs] [<ffffffffa0578e33>] btrfs_raid_unplug+0x23/0x70 [btrfs] [<ffffffff812d36c2>] blk_flush_plug_list+0x82/0x1f0 [<ffffffff812e0349>] blk_sq_make_request+0x1f9/0x740 [<ffffffff812ceba2>] ? generic_make_request_checks+0x222/0x7c0 [<ffffffff812cf264>] ? blk_queue_enter+0x124/0x310 [<ffffffff812cf1d2>] ? blk_queue_enter+0x92/0x310 [<ffffffff812d0ae2>] generic_make_request+0x172/0x2c0 [<ffffffff812d0ad4>] ? generic_make_request+0x164/0x2c0 [<ffffffff812d0ca0>] submit_bio+0x70/0x140 [<ffffffffa0577b29>] ? rbio_add_io_page+0x99/0x150 [btrfs] [<ffffffffa0578a89>] finish_rmw+0x4d9/0x600 [btrfs] [<ffffffffa0578c4c>] full_stripe_write+0x9c/0xc0 [btrfs] [<ffffffffa057ab7f>] raid56_parity_write+0xef/0x160 [btrfs] [<ffffffffa052bd83>] btrfs_map_bio+0xe3/0x2d0 [btrfs] [<ffffffffa04fbd6d>] btrfs_submit_bio_hook+0x8d/0x1d0 [btrfs] [<ffffffffa05173c4>] submit_one_bio+0x74/0xb0 [btrfs] [<ffffffffa0517f55>] submit_extent_page+0xe5/0x1c0 [btrfs] [<ffffffffa0519b18>] __extent_writepage_io+0x408/0x4c0 [btrfs] [<ffffffffa05179c0>] ? alloc_dummy_extent_buffer+0x140/0x140 [btrfs] [<ffffffffa051dc88>] __extent_writepage+0x218/0x3a0 [btrfs] [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffffa051e2c9>] extent_write_cache_pages.clone.0+0x2f9/0x400 [btrfs] [<ffffffffa051e422>] extent_writepages+0x52/0x70 [btrfs] [<ffffffffa05001f0>] ? btrfs_set_inode_index+0x70/0x70 [btrfs] [<ffffffffa04fcc17>] btrfs_writepages+0x27/0x30 [btrfs] [<ffffffff81184df3>] do_writepages+0x23/0x40 [<ffffffff81212229>] __writeback_single_inode+0x89/0x4d0 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff81212a60>] ? writeback_sb_inodes+0x260/0x480 [<ffffffff8121295f>] ? writeback_sb_inodes+0x15f/0x480 [<ffffffff81212ad2>] writeback_sb_inodes+0x2d2/0x480 [<ffffffff810b1397>] ? down_read_trylock+0x57/0x60 [<ffffffff811e6805>] ? trylock_super+0x25/0x60 [<ffffffff810d629f>] ? rcu_read_lock_sched_held+0x4f/0x90 [<ffffffff81212d0c>] __writeback_inodes_wb+0x8c/0xc0 [<ffffffff812130b5>] wb_writeback+0x2b5/0x500 [<ffffffff810b7ed8>] ? mark_held_locks+0x78/0xa0 [<ffffffff810660a8>] ? __local_bh_enable_ip+0x68/0xc0 [<ffffffff81213362>] ? wb_do_writeback+0x62/0x310 [<ffffffff812133c1>] wb_do_writeback+0xc1/0x310 [<ffffffff8107c3d9>] ? set_worker_desc+0x79/0x90 [<ffffffff81213842>] wb_workfn+0x92/0x330 [<ffffffff8107f133>] process_one_work+0x223/0x730 [<ffffffff8107f083>] ? process_one_work+0x173/0x730 [<ffffffff8108035f>] ? worker_thread+0x18f/0x430 [<ffffffff810802ed>] worker_thread+0x11d/0x430 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810801d0>] ? maybe_create_worker+0xf0/0xf0 [<ffffffff810858df>] kthread+0xef/0x110 [<ffffffff8108f74e>] ? schedule_tail+0x1e/0xd0 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 [<ffffffff816673bf>] ret_from_fork+0x3f/0x70 [<ffffffff810857f0>] ? __init_kthread_worker+0x70/0x70 The issue is that we've got the software context pinned while calling blk_flush_plug_list(), which flushes callbacks that are allowed to sleep. btrfs and raid has such callbacks. Flip the checks around a bit, so we can enable preempt a bit earlier and flush plugs without having preempt disabled. This only affects blk-mq driven devices, and only those that register a single queue. Reported-by:
Liu Bo <bo.li.liu@oracle.com> Tested-by:
Liu Bo <bo.li.liu@oracle.com> Cc: stable@kernel.org Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Nov 20, 2015
-
-
Kees Cook authored
If md->signature == MAC_DRIVER_MAGIC and md->block_size == 1023, a single 512 byte sector would be read (secsize / 512). However the partition structure would be located past the end of the buffer (secsize % 512). Signed-off-by:
Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Nov 19, 2015
-
-
Dan Williams authored
Fix use after free crashes like the following: general protection fault: 0000 [#1] SMP Call Trace: [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem] [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem] [<ffffffff8128fd90>] bdev_read_page+0x50/0x60 [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50 [<ffffffff81297657>] mpage_readpages+0x107/0x170 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20 [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20 [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310 [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310 [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0 [<ffffffff811c76f6>] filemap_fault+0x396/0x530 [<ffffffff811f816e>] __do_fault+0x4e/0xf0 [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50 Cc: <stable@vger.kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Reported-by:
kbuild test robot <lkp@intel.com> Acked-by:
Matthew Wilcox <willy@linux.intel.com> [willy: symmetry fixups] Signed-off-by:
Dan Williams <dan.j.williams@intel.com>
-
- Nov 16, 2015
-
-
Jan Kara authored
Currently blk_insert_flush() just adds flush request to q->queue_head when flush is not required. That completely bypasses IO scheduler so e.g. CFQ can be idling waiting for new request to arrive and will idle through the whole window unnecessarily. Luckily this only happens in rare cases as usually checks in generic_make_request_checks() clear FLUSH and FUA flags early if they are not needed. When no flushing is actually required, we can easily fix the problem by properly queueing the request through the IO scheduler. Ideally IO scheduler should be also made aware of requests queued via blk_flush_queue_rq(). However inserting flush request through IO scheduler can have unwanted side-effects since due to flush batching delaying the flush request in IO scheduler will delay all flush requests possibly coming from other processes. So we keep adding the request directly to q->queue_head. Signed-off-by:
Jan Kara <jack@suse.com> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Geliang Tang authored
To make the intention clearer, use list_{first,prev,next}_entry instead of list_entry. Signed-off-by:
Geliang Tang <geliangtang@163.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Nov 11, 2015
-
-
Randy Dunlap authored
Fix kernel-doc warning in blk-core.c: Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq' Signed-off-by:
Randy Dunlap <rdunlap@infradead.org> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Jens Axboe authored
It's no longer used outside of blk-mq core. Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Nov 07, 2015
-
-
Jens Axboe authored
Add basic support for polling for specific IO to complete. This uses the cookie that blk-mq passes back, which enables the block layer to pass this cookie to the driver to spin for a specific request. This will be combined with request latency tracking, so we can make qualified decisions about when to poll and when not to. For now, for benchmark purposes, we add a sysfs file that controls whether polling is enabled or not. Signed-off-by:
Jens Axboe <axboe@fb.com> Acked-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Keith Busch <keith.busch@intel.com>
-
Jens Axboe authored
Return a cookie, blk_qc_t, from the blk-mq make request functions, that allows a later caller to uniquely identify a specific IO. The cookie doesn't mean anything to the caller, but the caller can use it to later pass back to the block layer. The block layer can then identify the hardware queue and request from that cookie. Signed-off-by:
Jens Axboe <axboe@fb.com> Acked-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Keith Busch <keith.busch@intel.com>
-
Jens Axboe authored
No functional changes in this patch, but it prepares us for returning a more useful cookie related to the IO that was queued up. Signed-off-by:
Jens Axboe <axboe@fb.com> Acked-by:
Christoph Hellwig <hch@lst.de> Acked-by:
Keith Busch <keith.busch@intel.com>
-
Ben Segall authored
setpriority(PRIO_USER, 0, x) will change the priority of tasks outside of the current pid namespace. This is in contrast to both the other modes of setpriority and the example of kill(-1). Fix this. getpriority and ioprio have the same failure mode, fix them too. Eric said: : After some more thinking about it this patch sounds justifiable. : : My goal with namespaces is not to build perfect isolation mechanisms : as that can get into ill defined territory, but to build well defined : mechanisms. And to handle the corner cases so you can use only : a single namespace with well defined results. : : In this case you have found the two interfaces I am aware of that : identify processes by uid instead of by pid. Which quite frankly is : weird. Unfortunately the weird unexpected cases are hard to handle : in the usual way. : : I was hoping for a little more information. Changes like this one we : have to be careful of because someone might be depending on the current : behavior. I don't think they are and I do think this make sense as part : of the pid namespace. Signed-off-by:
Ben Segall <bsegall@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ambrose Feinstein <ambrose@google.com> Acked-by:
"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
__GFP_WAIT was used to signal that the caller was in atomic context and could not sleep. Now it is possible to distinguish between true atomic context and callers that are not willing to sleep. The latter should clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT behaves differently, there is a risk that people will clear the wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate what it does -- setting it allows all reclaim activity, clearing them prevents it. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: coding-style fixes] Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Acked-by:
David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Mel Gorman authored
mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by:
Mel Gorman <mgorman@techsingularity.net> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
Michal Hocko <mhocko@suse.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Nov 03, 2015
-
-
Jeff Moyer authored
Hi, Zhangqing Luo reported long boot times on a system with thousands of LUNs when scsi-mq was enabled. He narrowed the problem down to blk_mq_add_queue_tag_set, where every queue is frozen in order to set the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues added before it in sequence, which involves waiting for an RCU grace period for each one. We don't need to do this. After the second queue is added, only new queues need to be initialized with the shared tag. We can do that by percolating the flag up to the blk_mq_tag_set, and updating the newly added queue's hctxs if the flag is set. This problem was introduced by commit 0d2602ca (blk-mq: improve support for shared tags maps). Reported-and-tested-by:
Jason Luo <zhangqing.luo@oracle.com> Reviewed-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 28, 2015
-
-
Ming Lin authored
In commit b49a0871("block: remove split code in blkdev_issue_{discard,write_same}"), discard_granularity and alignment checks were removed. Ideally, with bio late splitting, the upper layers shouldn't need to depend on device's limits. Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe device when mkfs.xfs. We have not found the root cause yet. This patch re-adds discard_granularity and alignment checks by reverting the related changes in commit b49a0871. The good thing is now we can remove the 2G discard size cap and just use UINT_MAX to avoid bi_size overflow. Reviewed-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Ming Lin <ming.l@ssi.samsung.com> Reviewed-by:
Mike Snitzer <snitzer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 22, 2015
-
-
Tejun Heo authored
The stat files on the root cgroup shows stats for the whole system and usually don't contain any information which isn't available through the usual system monitoring mechanisms. Some controllers skip collecting these duplicate stats to optimize cases where cgroup isn't used and later try to emulate the result on demand. This leads to complexities and subtle differences in the information shown through different channels. This is entirely unnecessary and cgroup v2 is dropping stat files which are duplicate from all controllers. This patch removes "io.stat" from the root hierarchy. Signed-off-by:
Tejun Heo <tj@kernel.org> Acked-by:
Jens Axboe <axboe@kernel.dk> Cc: Vivek Goyal <vgoyal@redhat.com>
-
- Oct 21, 2015
-
-
Ming Lei authored
Most of times, flush plug should be the hottest I/O path, so mark ctx as pending after all requests in the list are inserted. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
The trace point is for tracing plug event of each request queue instead of each task, so we should check the request count in the plug list from current queue instead of current task. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
After bio splitting is introduced, one bio can be splitted and it is marked as NOMERGE because it is too fat to be merged, so check bio_mergeable() earlier to avoid to try to merge it unnecessarily. Signed-off-by:
Ming Lei <ming.lei@canonical.com> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
It isn't necessary to try to merge the bio which is marked as NOMERGE. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
The splitted bio has been already too fat to merge, so mark it as NOMERGE. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Ming Lei authored
The number of bio->bi_phys_segments is always obtained during bio splitting, so it is natural to setup it just after bio splitting, then we can avoid to compute nr_segment again during merge. Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Ming Lei <ming.lei@canonical.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Jeff Moyer authored
Request queues with merging disabled will not flush the plug list after BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies on blk_attempt_plug_merge to compute the request_count. Fix this by computing the number of queued requests even for nomerge queues. Signed-off-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
This commits adds a driver API and ioctls for controlling Persistent Reservations s/genericly/generically/ at the block layer. Persistent Reservations are supported by SCSI and NVMe and allow controlling who gets access to a device in a shared storage setup. Note that we add a pr_ops structure to struct block_device_operations instead of adding the members directly to avoid bloating all instances of devices that will never support Persistent Reservations. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Christoph Hellwig authored
Split out helpers for all non-trivial ioctls to make this function simpler, and also start passing around a pointer version of the argument, as that's what most ioctl handlers actually need. Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Williams authored
The libnvidmm-btt and nvme drivers use blk_integrity to reserve space for per-sector metadata, but sometimes without protection checksums. This property is generically useful, so teach the block core to internally specify a nop profile if one is not provided at registration time. Cc: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Suggested-by:
Christoph Hellwig <hch@lst.de> [hch: kill the local nvme nop profile as well] Acked-by:
Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Williams authored
Since they lack requests to pin the request_queue active, synchronous bio-based drivers may have in-flight integrity work from bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush that work to prevent races to free the queue and the final usage of the blk_integrity profile. This is temporary unless/until bio-based drivers start to generically take a q_usage_counter reference while a bio is in-flight. Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Williams authored
A trace like the following proceeds a crash in bio_integrity_process() when it goes to use an already freed blk_integrity profile. BUG: unable to handle kernel paging request at ffff8800d31b10d8 IP: [<ffff8800d31b10d8>] 0xffff8800d31b10d8 PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3 Oops: 0011 [#1] SMP Dumping ftrace buffer: --------------------------------- ndctl-2222 2.... 44526245us : disk_release: pmem1s systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s <...>-409 4.... 44574005us : bio_integrity_process: pmem1s --------------------------------- [..] Call Trace: [<ffffffff8144e0f9>] ? bio_integrity_process+0x159/0x2d0 [<ffffffff8144e4f6>] bio_integrity_verify_fn+0x36/0x60 [<ffffffff810bd2dc>] process_one_work+0x1cc/0x4e0 Given that a request_queue is pinned while i/o is in flight and that a gendisk is allowed to have a shorter lifetime, move blk_integrity to request_queue to satisfy requests arriving after the gendisk has been torn down. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case] Tested-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Dan Williams authored
Allow pmem, and other synchronous/bio-based block drivers, to fallback on a per-cpu reference count managed by the core for tracking queue live/dead state. The existing per-cpu reference count for the blk_mq case is promoted to be used in all block i/o scenarios. This involves initializing it by default, waiting for it to drop to zero at exit, and holding a live reference over the invocation of q->make_request_fn() in generic_make_request(). The blk_mq code continues to take its own reference per blk_mq request and retains the ability to freeze the queue, but the check that the queue is frozen is moved to generic_make_request(). This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 [..] Call Trace: [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 [<ffffffff8141f966>] submit_bio+0x76/0x170 [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 [<ffffffff81286e62>] submit_bh+0x12/0x20 [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 [<ffffffff81303494>] ext4_put_super+0x64/0x360 [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Tested-by:
Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
Up until now the_integrity profile has been dynamically allocated and attached to struct gendisk after the disk has been made active. This causes problems because NVMe devices need to register the profile prior to the partition table being read due to a mandatory metadata buffer requirement. In addition, DM goes through hoops to deal with preallocating, but not initializing integrity profiles. Since the integrity profile is small (4 bytes + a pointer), Christoph suggested moving it to struct gendisk proper. This requires several changes: - Moving the blk_integrity definition to genhd.h. - Inlining blk_integrity in struct gendisk. - Removing the dynamic allocation code. - Adding helper functions which allow gendisk to set up and tear down the integrity sysfs dir when a disk is added/deleted. - Adding a blk_integrity_revalidate() callback for updating the stable pages bdi setting. - The calls that depend on whether a device has an integrity profile or not now key off of the bi->profile pointer. - Simplifying the integrity support routines in DM (Mike Snitzer). Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reported-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Mike Snitzer <snitzer@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
The size of the data interval was not exported in the sysfs integrity directory. Export it so that userland apps can tell whether the interval is different from the device's logical block size. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
The per-device properties in the blk_integrity structure were previously unsigned short. However, most of the values fit inside a char. The only exception is the data interval size and we can work around that by storing it as a power of two. This cuts the size of the dynamic portion of blk_integrity in half. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reported-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
We previously made a complete copy of a device's data integrity profile even though several of the fields inside the blk_integrity struct are pointers to fixed template entries in t10-pi.c. Split the static and per-device portions so that we can reference the template directly. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reported-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
Martin K. Petersen authored
The integrity kobject purely exists to support the integrity subdirectory in sysfs and doesn't really have anything to do with the blk_integrity data structure. Move the kobject to struct gendisk where it belongs. Signed-off-by:
Martin K. Petersen <martin.petersen@oracle.com> Reported-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Sagi Grimberg <sagig@mellanox.com> Signed-off-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-
- Oct 15, 2015
-
-
Tejun Heo authored
bdi's are initialized in two steps, bdi_init() and bdi_register(), but destroyed in a single step by bdi_destroy() which, for a bdi embedded in a request_queue, is called during blk_cleanup_queue() which makes the queue invisible and starts the draining of remaining usages. A request_queue's user can access the congestion state of the embedded bdi as long as it holds a reference to the queue. As such, it may access the congested state of a queue which finished blk_cleanup_queue() but hasn't reached blk_release_queue() yet. Because the congested state was embedded in backing_dev_info which in turn is embedded in request_queue, accessing the congested state after bdi_destroy() was called was fine. The bdi was destroyed but the memory region for the congested state remained accessible till the queue got released. a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") changed the situation. Now, the root congested state which is expected to be pinned while request_queue remains accessible is separately reference counted and the base ref is put during bdi_destroy(). This means that the root congested state may go away prematurely while the queue is between bdi_dstroy() and blk_cleanup_queue(), which was detected by Andrey's KASAN tests. The root cause of this problem is that bdi doesn't distinguish the two steps of destruction, unregistration and release, and now the root congested state actually requires a separate release step. To fix the issue, this patch separates out bdi_unregister() and bdi_exit() from bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue() and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a simple wrapper calling the two steps back-to-back. While at it, the prototype of bdi_destroy() is moved right below bdi_setup_and_register() so that the counterpart operations are located together. Signed-off-by:
Tejun Heo <tj@kernel.org> Fixes: a13f35e8 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback") Cc: stable@vger.kernel.org # v4.2+ Reported-and-tested-by:
Andrey Konovalov <andreyknvl@google.com> Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com Reviewed-by:
Jan Kara <jack@suse.com> Reviewed-by:
Jeff Moyer <jmoyer@redhat.com> Signed-off-by:
Jens Axboe <axboe@fb.com>
-