Skip to content
  1. Feb 05, 2010
  2. Feb 02, 2010
    • Vivek Goyal's avatar
      cfq-iosched: Do not idle on async queues · 1efe8fe1
      Vivek Goyal authored
      
      
      Few weeks back, Shaohua Li had posted similar patch. I am reposting it
      with more test results.
      
      This patch does two things.
      
      - Do not idle on async queues.
      
      - It also changes the write queue depth CFQ drives (cfq_may_dispatch()).
        Currently, we seem to driving queue depth of 1 always for WRITES. This is
        true even if there is only one write queue in the system and all the logic
        of infinite queue depth in case of single busy queue as well as slowly
        increasing queue depth based on last delayed sync request does not seem to
        be kicking in at all.
      
      This patch will allow deeper WRITE queue depths (subjected to the other
      WRITE queue depth contstraints like cfq_quantum and last delayed sync
      request).
      
      Shaohua Li had reported getting more out of his SSD. For me, I have got
      one Lun exported from an HP EVA and when pure buffered writes are on, I
      can get more out of the system. Following are test results of pure
      buffered writes (with end_fsync=1) with vanilla and patched kernel. These
      results are average of 3 sets of run with increasing number of threads.
      
      AVERAGE[bufwfs][vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      bufwfs    3   1   0              0              95349          474141
      bufwfs    3   2   0              0              100282         806926
      bufwfs    3   4   0              0              109989         2.7301e+06
      bufwfs    3   8   0              0              116642         3762231
      bufwfs    3   16  0              0              118230         6902970
      
      AVERAGE[bufwfs] [patched kernel]
      -------
      bufwfs    3   1   0              0              270722         404352
      bufwfs    3   2   0              0              206770         1.06552e+06
      bufwfs    3   4   0              0              195277         1.62283e+06
      bufwfs    3   8   0              0              260960         2.62979e+06
      bufwfs    3   16  0              0              299260         1.70731e+06
      
      I also ran buffered writes along with some sequential reads and some
      buffered reads going on in the system on a SATA disk because the potential
      risk could be that we should not be driving queue depth higher in presence
      of sync IO going to keep the max clat low.
      
      With some random and sequential reads going on in the system on one SATA
      disk I did not see any significant increase in max clat. So it looks like
      other WRITE queue depth control logic is doing its job. Here are the
      results.
      
      AVERAGE[brr, bsr, bufw together] [vanilla]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   850            546345         0              0
      bsr       3   1   14650          729543         0              0
      bufw      3   1   0              0              23908          8274517
      
      brr       3   2   981.333        579395         0              0
      bsr       3   2   14149.7        1175689        0              0
      bufw      3   2   0              0              21921          1.28108e+07
      
      brr       3   4   898.333        1.75527e+06    0              0
      bsr       3   4   12230.7        1.40072e+06    0              0
      bufw      3   4   0              0              19722.3        2.4901e+07
      
      brr       3   8   900            3160594        0              0
      bsr       3   8   9282.33        1.91314e+06    0              0
      bufw      3   8   0              0              18789.3        23890622
      
      AVERAGE[brr, bsr, bufw mixed] [patched kernel]
      -------
      job       Set NR  ReadBW(KB/s)   MaxClat(us)    WriteBW(KB/s)  MaxClat(us)
      ---       --- --  ------------   -----------    -------------  -----------
      brr       3   1   837            417973         0              0
      bsr       3   1   14357.7        591275         0              0
      bufw      3   1   0              0              24869.7        8910662
      
      brr       3   2   1038.33        543434         0              0
      bsr       3   2   13351.3        1205858        0              0
      bufw      3   2   0              0              18626.3        13280370
      
      brr       3   4   913            1.86861e+06    0              0
      bsr       3   4   12652.3        1430974        0              0
      bufw      3   4   0              0              15343.3        2.81305e+07
      
      brr       3   8   890            2.92695e+06    0              0
      bsr       3   8   9635.33        1.90244e+06    0              0
      bufw      3   8   0              0              17200.3        24424392
      
      So looks like it might make sense to include this patch.
      
      Thanks
      Vivek
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      1efe8fe1
  3. Feb 01, 2010
    • Gui Jianfeng's avatar
      blk-cgroup: Fix potential deadlock in blk-cgroup · bcf4dd43
      Gui Jianfeng authored
      
      
      I triggered a lockdep warning as following.
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.33-rc2 #1
      -------------------------------------------------------
      test_io_control/7357 is trying to acquire lock:
       (blkio_list_lock){+.+...}, at: [<c053a990>] blkiocg_weight_write+0x82/0x9e
      
      but task is already holding lock:
       (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&(&blkcg->lock)->rlock){......}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053a4e1>] blkiocg_add_blkio_group+0x1a/0x6d
             [<c053cac7>] cfq_get_queue+0x225/0x3de
             [<c053eec2>] cfq_set_request+0x217/0x42d
             [<c052c8a6>] elv_set_request+0x17/0x26
             [<c0532a0f>] get_request+0x203/0x2c5
             [<c0532ae9>] get_request_wait+0x18/0x10e
             [<c0533470>] __make_request+0x2ba/0x375
             [<c0531985>] generic_make_request+0x28d/0x30f
             [<c0532da7>] submit_bio+0x8a/0x8f
             [<c04d827a>] submit_bh+0xf0/0x10f
             [<c04d91d2>] ll_rw_block+0xc0/0xf9
             [<f86e9705>] ext3_find_entry+0x319/0x544 [ext3]
             [<f86eae58>] ext3_lookup+0x2c/0xb9 [ext3]
             [<c04c3e1b>] do_lookup+0xd3/0x172
             [<c04c56c8>] link_path_walk+0x5fb/0x95c
             [<c04c5a65>] path_walk+0x3c/0x81
             [<c04c5b63>] do_path_lookup+0x21/0x8a
             [<c04c66cc>] do_filp_open+0xf0/0x978
             [<c04c0c7e>] open_exec+0x1b/0xb7
             [<c04c1436>] do_execve+0xbb/0x266
             [<c04081a9>] sys_execve+0x24/0x4a
             [<c04028a2>] ptregs_execve+0x12/0x18
      
      -> #1 (&(&q->__queue_lock)->rlock){..-.-.}:
             [<c04583b7>] validate_chain+0x8bc/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a
             [<c053dd2a>] cfq_unlink_blkio_group+0x17/0x41
             [<c053a6eb>] blkiocg_destroy+0x72/0xc7
             [<c0467df0>] cgroup_diput+0x4a/0xb2
             [<c04ca473>] dentry_iput+0x93/0xb7
             [<c04ca4b3>] d_kill+0x1c/0x36
             [<c04cb5c5>] dput+0xf5/0xfe
             [<c04c6084>] do_rmdir+0x95/0xbe
             [<c04c60ec>] sys_rmdir+0x10/0x12
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      -> #0 (blkio_list_lock){+.+...}:
             [<c0458117>] validate_chain+0x61c/0xb9c
             [<c0458dba>] __lock_acquire+0x723/0x789
             [<c0458eb0>] lock_acquire+0x90/0xa7
             [<c06929fd>] _raw_spin_lock+0x1e/0x4e
             [<c053a990>] blkiocg_weight_write+0x82/0x9e
             [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
             [<c04bd2f3>] vfs_write+0x8c/0x116
             [<c04bd7c6>] sys_write+0x3b/0x60
             [<c04027cc>] sysenter_do_call+0x12/0x32
      
      other info that might help us debug this:
      
      1 lock held by test_io_control/7357:
       #0:  (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e
      stack backtrace:
      Pid: 7357, comm: test_io_control Not tainted 2.6.33-rc2 #1
      Call Trace:
       [<c045754f>] print_circular_bug+0x91/0x9d
       [<c0458117>] validate_chain+0x61c/0xb9c
       [<c0458dba>] __lock_acquire+0x723/0x789
       [<c0458eb0>] lock_acquire+0x90/0xa7
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c06929fd>] _raw_spin_lock+0x1e/0x4e
       [<c053a990>] ? blkiocg_weight_write+0x82/0x9e
       [<c053a990>] blkiocg_weight_write+0x82/0x9e
       [<c0467f1e>] cgroup_file_write+0xc6/0x1c0
       [<c0454df5>] ? trace_hardirqs_off+0xb/0xd
       [<c044d93a>] ? cpu_clock+0x2e/0x44
       [<c050e6ec>] ? security_file_permission+0xf/0x11
       [<c04bcdda>] ? rw_verify_area+0x8a/0xad
       [<c0467e58>] ? cgroup_file_write+0x0/0x1c0
       [<c04bd2f3>] vfs_write+0x8c/0x116
       [<c04bd7c6>] sys_write+0x3b/0x60
       [<c04027cc>] sysenter_do_call+0x12/0x32
      
      To prevent deadlock, we should take locks as following sequence:
      
      blkio_list_lock -> queue_lock ->  blkcg_lock.
      
      The following patch should fix this bug.
      
      Signed-off-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      bcf4dd43
  4. Jan 11, 2010
  5. Dec 29, 2009
  6. Dec 28, 2009
  7. Dec 21, 2009
  8. Dec 18, 2009
    • Vivek Goyal's avatar
      cfq-iosched: Remove prio_change logic for workload selection · 65b32a57
      Vivek Goyal authored
      
      
      o CFQ now internally divides cfq queues in therr workload categories. sync-idle,
        sync-noidle and async. Which workload to run depends primarily on rb_key
        offset across three service trees. Which is a combination of mulitiple things
        including what time queue got queued on the service tree.
      
        There is one exception though. That is if we switched the prio class, say
        we served some RT tasks and again started serving BE class, then with-in
        BE class we always started with sync-noidle workload irrespective of rb_key
        offset in service trees.
      
        This can provide better latencies for sync-noidle workload in the presence
        of RT tasks.
      
      o This patch gets rid of that exception and which workload to run with-in
        class always depends on lowest rb_key across service trees. The reason
        being that now we have multiple BE class groups and if we always switch
        to sync-noidle workload with-in group, we can potentially starve a sync-idle
        workload with-in group. Same is true for async workload which will be in
        root group. Also the workload-switching with-in group will become very
        unpredictable as it now depends whether some RT workload was running in
        the system or not.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Acked-by: default avatarCorrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      65b32a57
    • Vivek Goyal's avatar
      cfq-iosched: Get rid of nr_groups · fb104db4
      Vivek Goyal authored
      
      
      o Currently code does not seem to be using cfqd->nr_groups. Get rid of it.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      fb104db4
    • Vivek Goyal's avatar
      cfq-iosched: Remove the check for same cfq group from allow_merge · 1db32c40
      Vivek Goyal authored
      
      
      o allow_merge() already checks if submitting task is pointing to same cfqq
        as rq has been queued in. If everything is fine, we should not be having
        a task in one cgroup and having a pointer to cfqq in other cgroup.
      
        Well I guess in some situations it can happen and that is, when a random
        IO queue has been moved into root cgroup for group_isolation=0. In
        this case, tasks's cgroup/group is different from where actually cfqq is,
        but this is intentional and in this case merging should be allowed.
      
        The second situation is where due to close cooperator patches, multiple
        processes can be sharing a cfqq. If everything implemented right, we should
        not end up in a situation where tasks from different processes in different
        groups are sharing the same cfqq as we allow merging of cooperating queues
        only if they are in same group.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      1db32c40
  9. Dec 16, 2009
    • Jens Axboe's avatar
      block: temporarily disable discard granularity · b568be62
      Jens Axboe authored
      
      
      Commit 86b37281 adds a check for
      misaligned stacking offsets, but it's buggy since the defaults are 0.
      Hence all dm devices that pass in a non-zero starting offset will
      be marked as misaligned amd dm will complain.
      
      A real fix is coming, in the mean time disable the discard granularity
      check so that users don't worry about dm reporting about misaligned
      devices.
      
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      b568be62
  10. Dec 15, 2009
  11. Dec 10, 2009
    • Vivek Goyal's avatar
      Fix a CFQ crash in "for-2.6.33" branch of block tree · 82bbbf28
      Vivek Goyal authored
      
      
      I think my previous patch introduced a bug which can lead to CFQ hitting
      BUG_ON().
      
      The offending commit in for-2.6.33 branch is.
      
      commit 7667aa06
      Author: Vivek Goyal <vgoyal@redhat.com>
      Date:   Tue Dec 8 17:52:58 2009 -0500
      
          cfq-iosched: Take care of corner cases of group losing share due to deletion
      
      While doing some stress testing on my box, I enountered following.
      
      login: [ 3165.148841] BUG: scheduling while
      atomic: swapper/0/0x10000100
      [ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb
      scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan]
      [ 3165.149821] Pid: 0, comm: swapper Not tainted
      2.6.32-block-for-33-merged-new #3
      [ 3165.149821] Call Trace:
      [ 3165.149821]  <IRQ>  [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60
      [ 3165.149821]  [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d
      [ 3165.149821]  [<ffffffff8153a979>] schedule+0xe3/0x7bc
      [ 3165.149821]  [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff810422d8>] __cond_resched+0x2a/0x35
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37
      [ 3165.149821]  [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f
      [ 3165.149821]  [<ffffffff811e4161>] report_bug+0x18/0xac
      [ 3165.149821]  [<ffffffff8100f1fc>] die+0x39/0x63
      [ 3165.149821]  [<ffffffff8153cde1>] do_trap+0x11a/0x129
      [ 3165.149821]  [<ffffffff8100d470>] do_invalid_op+0x96/0x9f
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67
      [ 3165.149821]  [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13
      [ 3165.149821]  [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4
      [ 3165.149821]  [<ffffffff8100c935>] invalid_op+0x15/0x20
      [ 3165.149821]  [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e
      [cfq_iosched]
      [ 3165.149821]  [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
      [ 3165.149821]  [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7
      [ 3165.149821]  [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21
      [ 3165.149821]  [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df
      [ 3165.149821]  [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17
      [ 3165.149821]  [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f
      [ 3165.149821]  [<ffffffff811d931f>] __blk_run_queue+0x42/0x71
      [ 3165.149821]  [<ffffffff811d9403>] blk_run_queue+0x26/0x3a
      [ 3165.149821]  [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375
      [ 3165.149821]  [<ffffffff812b60ac>] ? put_device+0x17/0x19
      [ 3165.149821]  [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b
      [ 3165.149821]  [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5
      [ 3165.149821]  [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe
      
      I think I have hit following BUG_ON() in cfq_dispatch_request().
      
      BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list));
      
      Please find attached the patch to fix it. I have done some stress testing
      with it and have not seen it happening again.
      
      o We should wait on a queue even after slice expiry only if it is empty. If
        queue is not empty then continue to expire it.
      
      o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue()
        will return a valid cfqq and cfq_dispatch_request() can hit following
        BUG_ON().
      
        BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list))
      
      Reviewed-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      82bbbf28
    • Gui Jianfeng's avatar
      cfq: Remove wait_request flag when idle time is being deleted · 554554f6
      Gui Jianfeng authored
      
      
      Remove wait_request flag when idle time is being deleted, otherwise
      it'll hit this path every time when a request is enqueued.
      
      Signed-off-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      554554f6
  12. Dec 09, 2009
  13. Dec 07, 2009
  14. Dec 06, 2009
  15. Dec 04, 2009
  16. Dec 03, 2009
    • Jens Axboe's avatar
      2f5ea477
    • Vivek Goyal's avatar
      blkio: Wait on sync-noidle queue even if rq_noidle = 1 · c04645e5
      Vivek Goyal authored
      
      
      o rq_noidle() is supposed to tell cfq that do not expect a request after this
        one, hence don't idle. But this does not seem to work very well. For example
        for direct random readers, rq_noidle = 1 but there is next request coming
        after this. Not idling, leads to a group not getting its share even if
        group_isolation=1.
      
      o The right solution for this issue is to scan the higher layers and set
        right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single
        line fix helps. This should not have any significant impact when we are
        not using cgroups. I will later figure out IO paths in higher layer and
        fix it.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      c04645e5
    • Vivek Goyal's avatar
      blkio: Implement group_isolation tunable · ae30c286
      Vivek Goyal authored
      
      
      o If a group is running only a random reader, then it will not have enough
        traffic to keep disk busy and we will reduce overall throughput. This
        should result in better latencies for random reader though. If we don't
        idle on random reader service tree, then this random reader will experience
        large latencies if there are other groups present in system with sequential
        readers running in these.
      
      o One solution suggested by corrado is that by default keep the random readers
        or sync-noidle workload in root group so that during one dispatch round
        we idle only once on sync-noidle tree. This means that all the sync-idle
        workload queues will be in their respective group and we will see service
        differentiation in those but not on sync-noidle workload.
      
      o Provide a tunable group_isolation. If set, this will make sure that even
        sync-noidle queues go in their respective group and we wait on these. This
        provides stronger isolation between groups but at the expense of throughput
        if group does not have enough traffic to keep the disk busy.
      
      o By default group_isolation = 0
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      ae30c286
    • Vivek Goyal's avatar
      blkio: Determine async workload length based on total number of queues · f26bd1f0
      Vivek Goyal authored
      
      
      o Async queues are not per group. Instead these are system wide and maintained
        in root group. Hence their workload slice length should be calculated
        based on total number of queues in the system and not just queues in the
        root group.
      
      o As root group's default weight is 1000, make sure to charge async queue
        more in terms of vtime so that it does not get more time on disk because
        root group has higher weight.
      
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jens.axboe@oracle.com>
      f26bd1f0
Loading