Skip to content
  1. Aug 04, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing / histogram: Give calculation hist_fields a size · 2c05caa7
      Steven Rostedt (VMware) authored
      When working on my user space applications, I found a bug in the synthetic
      event code where the automated synthetic event field was not matching the
      event field calculation it was attached to. Looking deeper into it, it was
      because the calculation hist_field was not given a size.
      
      The synthetic event fields are matched to their hist_fields either by
      having the field have an identical string type, or if that does not match,
      then the size and signed values are used to match the fields.
      
      The problem arose when I tried to match a calculation where the fields
      were "unsigned int". My tool created a synthetic event of type "u32". But
      it failed to match. The string was:
      
        diff=field1-field2:onmatch(event).trace(synth,$diff)
      
      Adding debugging into the kernel, I found that the size of "diff" was 0.
      And since it was given "unsigned int" as a type, the histogram fallback
      code used size and signed. The signed matched, but the size of u32 (4) did
      not match zero, and the event failed to be created.
      
      This can be worse if the field you want to match is not one of the
      acceptable fields for a synthetic event. As event fields can have any type
      that is supported in Linux, this can cause an issue. For example, if a
      type is an enum. Then there's no way to use that with any calculations.
      
      Have the calculation field simply take on the size of what it is
      calculating.
      
      Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
      
      
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      2c05caa7
  2. Jul 30, 2021
  3. Jul 23, 2021
    • Colin Ian King's avatar
      ftrace: Remove redundant initialization of variable ret · 3b1a8f45
      Colin Ian King authored
      The variable ret is being initialized with a value that is never
      read, it is being updated later on. The assignment is redundant and
      can be removed.
      
      Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com
      
      
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3b1a8f45
    • Nicolas Saenz Julienne's avatar
      ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary · 68e83498
      Nicolas Saenz Julienne authored
      synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
      all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
      very disrupting (hence the name). So avoid calling it when 'old_hash'
      doesn't need to be freed.
      
      Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.com
      
      
      
      Signed-off-by: default avatarNicolas Saenz Julienne <nsaenzju@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      68e83498
    • Steven Rostedt (VMware)'s avatar
      tracing: Clean up alloc_synth_event() · 9528c195
      Steven Rostedt (VMware) authored
      alloc_synth_event() currently has the following code to initialize the
      event fields and dynamic_fields:
      
      	for (i = 0, j = 0; i < n_fields; i++) {
      		event->fields[i] = fields[i];
      
      		if (fields[i]->is_dynamic) {
      			event->dynamic_fields[j] = fields[i];
      			event->dynamic_fields[j]->field_pos = i;
      			event->dynamic_fields[j++] = fields[i];
      			event->n_dynamic_fields++;
      		}
      	}
      
      1) It would make more sense to have all fields keep track of their
         field_pos.
      
      2) event->dynmaic_fields[j] is assigned twice for no reason.
      
      3) We can move updating event->n_dynamic_fields outside the loop, and just
         assign it to j.
      
      This combination makes the code much cleaner.
      
      Link: https://lkml.kernel.org/r/20210721195341.29bb0f77@oasis.local.home
      
      
      
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      9528c195
    • Steven Rostedt (VMware)'s avatar
      tracing/histogram: Rename "cpu" to "common_cpu" · 1e3bac71
      Steven Rostedt (VMware) authored
      Currently the histogram logic allows the user to write "cpu" in as an
      event field, and it will record the CPU that the event happened on.
      
      The problem with this is that there's a lot of events that have "cpu"
      as a real field, and using "cpu" as the CPU it ran on, makes it
      impossible to run histograms on the "cpu" field of events.
      
      For example, if I want to have a histogram on the count of the
      workqueue_queue_work event on its cpu field, running:
      
       ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
      
      Gives a misleading and wrong result.
      
      Change the command to "common_cpu" as no event should have "common_*"
      fields as that's a reserved name for fields used by all events. And
      this makes sense here as common_cpu would be a field used by all events.
      
      Now we can even do:
      
       ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
       ># cat events/workqueue/workqueue_queue_work/hist
       # event histogram
       #
       # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
       #
      
       { common_cpu:          0, cpu:          2 } hitcount:          1
       { common_cpu:          0, cpu:          4 } hitcount:          1
       { common_cpu:          7, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          1 } hitcount:          1
       { common_cpu:          0, cpu:          6 } hitcount:          2
       { common_cpu:          0, cpu:          5 } hitcount:          2
       { common_cpu:          1, cpu:          1 } hitcount:          4
       { common_cpu:          6, cpu:          6 } hitcount:          4
       { common_cpu:          5, cpu:          5 } hitcount:         14
       { common_cpu:          4, cpu:          4 } hitcount:         26
       { common_cpu:          0, cpu:          0 } hitcount:         39
       { common_cpu:          2, cpu:          2 } hitcount:        184
      
      Now for backward compatibility, I added a trick. If "cpu" is used, and
      the field is not found, it will fall back to "common_cpu" and work as
      it did before. This way, it will still work for old programs that use
      "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
      will get that event's "cpu" field, which is probably what it wants
      anyway.
      
      I updated the tracefs/README to include documentation about both the
      common_timestamp and the common_cpu. This way, if that text is present in
      the README, then an application can know that common_cpu is supported over
      just plain "cpu".
      
      Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
      
      
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 8b7622bf ("tracing: Add cpu field for hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      1e3bac71
    • Steven Rostedt (VMware)'s avatar
      tracing: Synthetic event field_pos is an index not a boolean · 3b13911a
      Steven Rostedt (VMware) authored
      Performing the following:
      
       ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
       ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
       ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
            > events/sched/sched_switch/trigger
       ># echo 1 > events/synthetic/enable
      
      Crashed the kernel:
      
       BUG: kernel NULL pointer dereference, address: 000000000000001b
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:strlen+0x0/0x20
       Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
        20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
        48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
       RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
       RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
       RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
       RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
       R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
       R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
       FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
       Call Trace:
        trace_event_raw_event_synth+0x90/0x1d0
        action_trace+0x5b/0x70
        event_hist_trigger+0x4bd/0x4e0
        ? cpumask_next_and+0x20/0x30
        ? update_sd_lb_stats.constprop.0+0xf6/0x840
        ? __lock_acquire.constprop.0+0x125/0x550
        ? find_held_lock+0x32/0x90
        ? sched_clock_cpu+0xe/0xd0
        ? lock_release+0x155/0x440
        ? update_load_avg+0x8c/0x6f0
        ? enqueue_entity+0x18a/0x920
        ? __rb_reserve_next+0xe5/0x460
        ? ring_buffer_lock_reserve+0x12a/0x3f0
        event_triggers_call+0x52/0xe0
        trace_event_buffer_commit+0x1ae/0x240
        trace_event_raw_event_sched_switch+0x114/0x170
        __traceiter_sched_switch+0x39/0x50
        __schedule+0x431/0xb00
        schedule_idle+0x28/0x40
        do_idle+0x198/0x2e0
        cpu_startup_entry+0x19/0x20
        secondary_startup_64_no_verify+0xc2/0xcb
      
      The reason is that the dynamic events array keeps track of the field
      position of the fields array, via the field_pos variable in the
      synth_field structure. Unfortunately, that field is a boolean for some
      reason, which means any field_pos greater than 1 will be a bug (in this
      case it was 2).
      
      Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home
      
      
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: bd82631d ("tracing: Add support for dynamic strings to synthetic events")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3b13911a
  4. Jul 22, 2021
    • Haoran Luo's avatar
      tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. · 67f0d6d9
      Haoran Luo authored
      The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
      "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
      the same buffer page, whose "buffer_data_page" is empty and "read" field
      is non-zero.
      
      An error scenario could be constructed as followed (kernel perspective):
      
      1. All pages in the buffer has been accessed by reader(s) so that all of
      them will have non-zero "read" field.
      
      2. Read and clear all buffer pages so that "rb_num_of_entries()" will
      return 0 rendering there's no more data to read. It is also required
      that the "read_page", "commit_page" and "tail_page" points to the same
      page, while "head_page" is the next page of them.
      
      3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
      so that it shot pass the end of current tail buffer page. Now the
      "head_page", "commit_page" and "tail_page" points to the same page.
      
      4. Discard current event with "ring_buffer_discard_commit()", so that
      "head_page", "commit_page" and "tail_page" points to a page whose buffer
      data page is now empty.
      
      When the error scenario has been constructed, "tracing_read_pipe" will
      be trapped inside a deadloop: "trace_empty()" returns 0 since
      "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
      constructed ring buffer. Then "trace_find_next_entry_inc()" always
      return NULL since "rb_num_of_entries()" reports there's no more entry
      to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
      "tracing_read_pipe" back to the start of the "waitagain" loop.
      
      I've also written a proof-of-concept script to construct the scenario
      and trigger the bug automatically, you can use it to trace and validate
      my reasoning above:
      
        https://github.com/aegistudio/RingBufferDetonator.git
      
      Tests has been carried out on linux kernel 5.14-rc2
      (2734d6c1), my fixed version
      of kernel (for testing whether my update fixes the bug) and
      some older kernels (for range of affected kernels). Test result is
      also attached to the proof-of-concept repository.
      
      Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
      Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: bf41a158 ("ring-buffer: make reentrant")
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: default avatarHaoran Luo <www@aegistudio.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      67f0d6d9
  5. Jul 15, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Do not reference char * as a string in histograms · 704adfb5
      Steven Rostedt (VMware) authored
      The histogram logic was allowing events with char * pointers to be used as
      normal strings. But it was easy to crash the kernel with:
      
       # echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger
      
      And open some files, and boom!
      
       BUG: unable to handle page fault for address: 00007f2ced0c3280
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
      v03.03 07/14/2016
       RIP: 0010:strlen+0x0/0x20
       Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b
      a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74
      10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3
      
       RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246
       RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0
       RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280
       RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98
       R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074
       R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280
       FS:  00007f2ced0f8580(0000) GS:ffff93825a800000(0000)
      knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0
       Call Trace:
        event_hist_trigger+0x463/0x5f0
        ? find_held_lock+0x32/0x90
        ? sched_clock_cpu+0xe/0xd0
        ? lock_release+0x155/0x440
        ? kernel_init_free_pages+0x6d/0x90
        ? preempt_count_sub+0x9b/0xd0
        ? kernel_init_free_pages+0x6d/0x90
        ? get_page_from_freelist+0x12c4/0x1680
        ? __rb_reserve_next+0xe5/0x460
        ? ring_buffer_lock_reserve+0x12a/0x3f0
        event_triggers_call+0x52/0xe0
        ftrace_syscall_enter+0x264/0x2c0
        syscall_trace_enter.constprop.0+0x1ee/0x210
        do_syscall_64+0x1c/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Where it triggered a fault on strlen(key) where key was the filename.
      
      The reason is that filename is a char * to user space, and the histogram
      code just blindly dereferenced it, with obvious bad results.
      
      I originally tried to use strncpy_from_user/kernel_nofault() but found
      that there's other places that its dereferenced and not worth the effort.
      
      Just do not allow "char *" to act like strings.
      
      Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home
      
      
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
      Cc: stable@vger.kernel.org
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Acked-by: default avatarTom Zanussi <zanussi@kernel.org>
      Fixes: 79e577cb ("tracing: Support string type key properly")
      Fixes: 5967bd5c ("tracing: Let filter_assign_type() detect FILTER_PTR_STRING")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      704adfb5
  6. Jul 08, 2021
  7. Jul 07, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing/histograms: Fix parsing of "sym-offset" modifier · 26c56373
      Steven Rostedt (VMware) authored
      With the addition of simple mathematical operations (plus and minus), the
      parsing of the "sym-offset" modifier broke, as it took the '-' part of the
      "sym-offset" as a minus, and tried to break it up into a mathematical
      operation of "field.sym - offset", in which case it failed to parse
      (unless the event had a field called "offset").
      
      Both .sym and .sym-offset modifiers should not be entered into
      mathematical calculations anyway. If ".sym-offset" is found in the
      modifier, then simply make it not an operation that can be calculated on.
      
      Link: https://lkml.kernel.org/r/20210707110821.188ae255@oasis.local.home
      
      
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 100719dc ("tracing: Add simple expression support to hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      26c56373
  8. Jul 01, 2021
  9. Jun 30, 2021
    • Paul Burton's avatar
      tracing: Simplify & fix saved_tgids logic · b81b3e95
      Paul Burton authored
      The tgid_map array records a mapping from pid to tgid, where the index
      of an entry within the array is the pid & the value stored at that index
      is the tgid.
      
      The saved_tgids_next() function iterates over pointers into the tgid_map
      array & dereferences the pointers which results in the tgid, but then it
      passes that dereferenced value to trace_find_tgid() which treats it as a
      pid & does a further lookup within the tgid_map array. It seems likely
      that the intent here was to skip over entries in tgid_map for which the
      recorded tgid is zero, but instead we end up skipping over entries for
      which the thread group leader hasn't yet had its own tgid recorded in
      tgid_map.
      
      A minimal fix would be to remove the call to trace_find_tgid, turning:
      
        if (trace_find_tgid(*ptr))
      
      into:
      
        if (*ptr)
      
      ..but it seems like this logic can be much simpler if we simply let
      seq_read() iterate over the whole tgid_map array & filter out empty
      entries by returning SEQ_SKIP from saved_tgids_show(). Here we take that
      approach, removing the incorrect logic here entirely.
      
      Link: https://lkml.kernel.org/r/20210630003406.4013668-1-paulburton@google.com
      
      
      
      Fixes: d914ba37 ("tracing: Add support for recording tgid of tasks")
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarPaul Burton <paulburton@google.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      b81b3e95
    • Austin Kim's avatar
      tracing: Change variable type as bool for clean-up · bfbf8d15
      Austin Kim authored
      The wakeup_rt wakeup_dl, tracing_dl is only set to 0, 1.
      So changing type of wakeup_rt wakeup_dl, tracing_dl as bool
      makes relevant routine be more readable.
      
      Link: https://lkml.kernel.org/r/20210629140548.GA1627@raspberrypi
      
      
      
      Signed-off-by: default avatarAustin Kim <austin.kim@lge.com>
      [ Removed unneeded initialization of static bool tracing_dl ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bfbf8d15
  10. Jun 29, 2021
  11. Jun 28, 2021
  12. Jun 25, 2021
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Support hotplug operations · c8895e27
      Daniel Bristot de Oliveira authored
      Enable and disable osnoise/timerlat thread during on CPU hotplug online
      and offline operations respectivelly.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/39f98590b3caeb3c32f09526214058efe0e9272a.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      c8895e27
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Support hotplug operations · ba998f7d
      Daniel Bristot de Oliveira authored
      Enable and disable hwlat thread during cpu hotplug online
      and offline operations, respectivelly.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/52012d25ea35491a0f8088b947864d8df8e25157.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ba998f7d
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Protect kdata->kthread with get/put_online_cpus · 039a602d
      Daniel Bristot de Oliveira authored
      In preparation to the hotplug support, protect kdata->kthread
      with get/put_online_cpus() to avoid concurrency with hotplug
      operations.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/8bdb2a56f46abfd301d6fffbf43448380c09a6f5.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      039a602d
    • Daniel Bristot de Oliveira's avatar
      trace: Add timerlat tracer · a955d7ea
      Daniel Bristot de Oliveira authored
      The timerlat tracer aims to help the preemptive kernel developers to
      found souces of wakeup latencies of real-time threads. Like cyclictest,
      the tracer sets a periodic timer that wakes up a thread. The thread then
      computes a *wakeup latency* value as the difference between the *current
      time* and the *absolute time* that the timer was set to expire. The main
      goal of timerlat is tracing in such a way to help kernel developers.
      
      Usage
      
      Write the ASCII text "timerlat" into the current_tracer file of the
      tracing system (generally mounted at /sys/kernel/tracing).
      
      For example:
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo timerlat > current_tracer
      
      It is possible to follow the trace by reading the trace trace file:
      
        [root@f32 tracing]# cat trace
        # tracer: timerlat
        #
        #                              _-----=> irqs-off
        #                             / _----=> need-resched
        #                            | / _---=> hardirq/softirq
        #                            || / _--=> preempt-depth
        #                            || /
        #                            ||||             ACTIVATION
        #         TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY
        #            | |         |   ||||      |         |                  |                       |
                <idle>-0       [000] d.h1    54.029328: #1     context    irq timer_latency       932 ns
                 <...>-867     [000] ....    54.029339: #1     context thread timer_latency     11700 ns
                <idle>-0       [001] dNh1    54.029346: #1     context    irq timer_latency      2833 ns
                 <...>-868     [001] ....    54.029353: #1     context thread timer_latency      9820 ns
                <idle>-0       [000] d.h1    54.030328: #2     context    irq timer_latency       769 ns
                 <...>-867     [000] ....    54.030330: #2     context thread timer_latency      3070 ns
                <idle>-0       [001] d.h1    54.030344: #2     context    irq timer_latency       935 ns
                 <...>-868     [001] ....    54.030347: #2     context thread timer_latency      4351 ns
      
      The tracer creates a per-cpu kernel thread with real-time priority that
      prints two lines at every activation. The first is the *timer latency*
      observed at the *hardirq* context before the activation of the thread.
      The second is the *timer latency* observed by the thread, which is the
      same level that cyclictest reports. The ACTIVATION ID field
      serves to relate the *irq* execution to its respective *thread* execution.
      
      The irq/thread splitting is important to clarify at which context
      the unexpected high value is coming from. The *irq* context can be
      delayed by hardware related actions, such as SMIs, NMIs, IRQs
      or by a thread masking interrupts. Once the timer happens, the delay
      can also be influenced by blocking caused by threads. For example, by
      postponing the scheduler execution via preempt_disable(),  by the
      scheduler execution, or by masking interrupts. Threads can
      also be delayed by the interference from other threads and IRQs.
      
      The timerlat can also take advantage of the osnoise: traceevents.
      For example:
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo timerlat > current_tracer
              [root@f32 tracing]# echo osnoise > set_event
              [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
              [root@f32 tracing]# tail -10 trace
                   cc1-87882   [005] d..h...   548.771078: #402268 context    irq timer_latency      1585 ns
                   cc1-87882   [005] dNLh1..   548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
                   cc1-87882   [005] dNLh2..   548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
                   cc1-87882   [005] dNLh2..   548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
                   cc1-87882   [005] dNLh2..   548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
                   cc1-87882   [005] dNLh2..   548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
                   cc1-87882   [005] dNLh2..   548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
                   cc1-87882   [005] dNLh2..   548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
                   cc1-87882   [005] d...3..   548.771101: thread_noise:      cc1:87882 start 548.771078243 duration 10909 ns
            timerlat/5-1035    [005] .......   548.771103: #402268 context thread timer_latency     25960 ns
      
      For further information see: Documentation/trace/timerlat-tracer.rst
      
      Link: https://lkml.kernel.org/r/71f18efc013e1194bcaea1e54db957de2b19ba62.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a955d7ea
    • Daniel Bristot de Oliveira's avatar
      trace: Add osnoise tracer · bce29ac9
      Daniel Bristot de Oliveira authored
      In the context of high-performance computing (HPC), the Operating System
      Noise (*osnoise*) refers to the interference experienced by an application
      due to activities inside the operating system. In the context of Linux,
      NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
      system. Moreover, hardware-related jobs can also cause noise, for example,
      via SMIs.
      
      The osnoise tracer leverages the hwlat_detector by running a similar
      loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
      the sources of *osnoise* during its execution. Using the same approach
      of hwlat, osnoise takes note of the entry and exit point of any
      source of interferences, increasing a per-cpu interference counter. The
      osnoise tracer also saves an interference counter for each source of
      interference. The interference counter for NMI, IRQs, SoftIRQs, and
      threads is increased anytime the tool observes these interferences' entry
      events. When a noise happens without any interference from the operating
      system level, the hardware noise counter increases, pointing to a
      hardware-related noise. In this way, osnoise can account for any
      source of interference. At the end of the period, the osnoise tracer
      prints the sum of all noise, the max single noise, the percentage of CPU
      available for the thread, and the counters for the noise sources.
      
      Usage
      
      Write the ASCII text "osnoise" into the current_tracer file of the
      tracing system (generally mounted at /sys/kernel/tracing).
      
      For example::
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo osnoise > current_tracer
      
      It is possible to follow the trace by reading the trace trace file::
      
              [root@f32 tracing]# cat trace
              # tracer: osnoise
              #
              #                                _-----=> irqs-off
              #                               / _----=> need-resched
              #                              | / _---=> hardirq/softirq
              #                              || / _--=> preempt-depth                            MAX
              #                              || /                                             SINGLE     Interference counters:
              #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
              #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
              #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                         <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
                         <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
                         <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
                         <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
                         <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
                         <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
                         <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
                         <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
      
      In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
      tracer prints a message at the end of each period for each CPU that is
      running an osnoise/CPU thread. The osnoise specific fields report:
      
       - The RUNTIME IN USE reports the amount of time in microseconds that
         the osnoise thread kept looping reading the time.
       - The NOISE IN US reports the sum of noise in microseconds observed
         by the osnoise tracer during the associated runtime.
       - The % OF CPU AVAILABLE reports the percentage of CPU available for
         the osnoise thread during the runtime window.
       - The MAX SINGLE NOISE IN US reports the maximum single noise observed
         during the runtime window.
       - The Interference counters display how many each of the respective
         interference happened during the runtime window.
      
      Note that the example above shows a high number of HW noise samples.
      The reason being is that this sample was taken on a virtual machine,
      and the host interference is detected as a hardware interference.
      
      Tracer options
      
      The tracer has a set of options inside the osnoise directory, they are:
      
       - osnoise/cpus: CPUs at which a osnoise thread will execute.
       - osnoise/period_us: the period of the osnoise thread.
       - osnoise/runtime_us: how long an osnoise thread will look for noise.
       - osnoise/stop_tracing_us: stop the system tracing if a single noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - osnoise/stop_tracing_total_us: stop the system tracing if total noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - tracing_threshold: the minimum delta between two time() reads to be
         considered as noise, in us. When set to 0, the default value will
         be used, which is currently 5 us.
      
      Additional Tracing
      
      In addition to the tracer, a set of tracepoints were added to
      facilitate the identification of the osnoise source.
      
       - osnoise:sample_threshold: printed anytime a noise is higher than
         the configurable tolerance_ns.
       - osnoise:nmi_noise: noise from NMI, including the duration.
       - osnoise:irq_noise: noise from an IRQ, including the duration.
       - osnoise:softirq_noise: noise from a SoftIRQ, including the
         duration.
       - osnoise:thread_noise: noise from a thread, including the duration.
      
      Note that all the values are *net values*. For example, if while osnoise
      is running, another thread preempts the osnoise thread, it will start a
      thread_noise duration at the start. Then, an IRQ takes place, preempting
      the thread_noise, starting a irq_noise. When the IRQ ends its execution,
      it will compute its duration, and this duration will be subtracted from
      the thread_noise, in such a way as to avoid the double accounting of the
      IRQ execution. This logic is valid for all sources of noise.
      
      Here is one example of the usage of these tracepoints::
      
             osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
             osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
           migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
             osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
      
      In this example, a noise sample of 8 microseconds was reported in the last
      line, pointing to two interferences. Looking backward in the trace, the
      two previous entries were about the migration thread running after a
      timer IRQ execution. The first event is not part of the noise because
      it took place one millisecond before.
      
      It is worth noticing that the sum of the duration reported in the
      tracepoints is smaller than eight us reported in the sample_threshold.
      The reason roots in the overhead of the entry and exit code that happens
      before and after any interference execution. This justifies the dual
      approach: measuring thread and tracing.
      
      Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      [
        Made the following functions static:
         trace_irqentry_callback()
         trace_irqexit_callback()
         trace_intel_irqentry_callback()
         trace_intel_irqexit_callback()
      
        Added to include/trace.h:
         osnoise_arch_register()
         osnoise_arch_unregister()
      
        Fixed define logic for LATENCY_FS_NOTIFY
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bce29ac9
    • Steven Rostedt (VMware)'s avatar
      tracing: Add LATENCY_FS_NOTIFY to define if latency_fsnotify() is defined · 6880c987
      Steven Rostedt (VMware) authored
      
      
      With the coming addition of the osnoise tracer, the configs needed to
      include the latency_fsnotify() has become more complex, and to keep the
      declaration in the header file the same as in the C file, just have the
      logic needed to define it in one place, and that defines LATENCY_FS_NOTIFY
      which will be used in the C code.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      6880c987
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Remove printk from sampling loop · aa892f8c
      Daniel Bristot de Oliveira authored
      hwlat has some time operation checks on the sample loop, and it is
      currently using pr_err (printk) to report them. The problem is that
      this can lead the system to an unresponsible state due to an overflow of
      printk messages. This problem can be mitigated by writing the error
      message to the trace buffer.
      
      Remove the printk messages from the sampling loop, switching the to
      messages in the trace buffer.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/9d77c34869748aa105e965c769d24642914eea3a.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      aa892f8c
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Use trace_min_max_param for width and window params · f27a1c9e
      Daniel Bristot de Oliveira authored
      Use the trace_min_max_param to reduce code duplication.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/b91accd5a7c6c14ea02d3379aae974ba22b47dd6.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f27a1c9e
    • Daniel Bristot de Oliveira's avatar
      trace: Add a generic function to read/write u64 values from tracefs · bc87cf0a
      Daniel Bristot de Oliveira authored
      The hwlat detector and (in preparation for) the osnoise/timerlat tracers
      have a set of u64 parameters that the user can read/write via tracefs.
      For instance, we have hwlat_detector's window and width.
      
      To reduce the code duplication, hwlat's window and width share the same
      read function. However, they do not share the write functions because
      they do different parameter checks. For instance, the width needs to
      be smaller than the window, while the window needs to be larger
      than the window. The same pattern repeats on osnoise/timerlat, and
      a large portion of the code was devoted to the write function.
      
      Despite having different checks, the write functions have the same
      structure:
      
         read a user-space buffer
         take the lock that protects the value
         check for minimum and maximum acceptable values
            save the value
         release the lock
         return success or error
      
      To reduce the code duplication also in the write functions, this patch
      provides a generic read and write implementation for u64 values that
      need to be within some minimum and/or maximum parameters, while
      (potentially) being protected by a lock.
      
      To use this interface, the structure trace_min_max_param needs to be
      filled:
      
       struct trace_min_max_param {
               struct mutex    *lock;
               u64             *val;
               u64             *min;
               u64             *max;
       };
      
      The desired value is stored on the variable pointed by *val. If *min
      points to a minimum acceptable value, it will be checked during the
      write operation. Likewise, if *max points to a maximum allowable value,
      it will be checked during the write operation. Finally, if *lock points
      to a mutex, it will be taken at the beginning of the operation and
      released at the end.
      
      The definition of a trace_min_max_param needs to passed as the
      (private) *data for tracefs_create_file(), and the trace_min_max_fops
      (added by this patch) as the *fops file_operations.
      
      Link: https://lkml.kernel.org/r/3e35760a7c8b5c55f16ae5ad5fc54a0e71cbe647.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bc87cf0a
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Implement the per-cpu mode · f46b1652
      Daniel Bristot de Oliveira authored
      
      
      Implements the per-cpu mode in which a sampling thread is created for
      each cpu in the "cpus" (and tracing_mask).
      
      The per-cpu mode has the potention to speed up the hwlat detection by
      running on multiple CPUs at the same time, at the cost of higher cpu
      usage with irqs disabled. Use with care.
      
      [
        Changed get_cpu_data() to static.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      ]
      
      Link: https://lkml.kernel.org/r/ec06d0ab340e8460d293772faba19ad8a5c371aa.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f46b1652
  13. Jun 24, 2021
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Switch disable_migrate to mode none · 7bb7d802
      Daniel Bristot de Oliveira authored
      When in the round-robin mode, if the tracer detects a change in the
      hwlatd thread affinity by an external tool, e.g., taskset, the
      round-robin logic is disabled. The disable_migrate variable currently
      tracks this.
      
      With the addition of the "mode" config and the mode "none," the
      disable_migrate logic is equivalent to switch to the "none" mode.
      
      Hence, instead of using a hidden variable to track this behavior,
      switch the mode to none, informing the user about this change.
      
      Link: https://lkml.kernel.org/r/a679af672458d6b1f62252605905c5214030f247.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      7bb7d802
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Implement the mode config option · 8fa826b7
      Daniel Bristot de Oliveira authored
      Provides the "mode" config to the hardware latency detector. hwlatd has
      two different operation modes. The default mode is the "round-robin" one,
      in which a single hwlatd thread runs, migrating among the allowed CPUs in a
      "round-robin" fashion. This is the current behavior.
      
      The "none" sets the allowed cpumask for a single hwlatd thread at the
      startup, but skips the round-robin, letting the scheduler handle the
      migration.
      
      In preparation to the per-cpu mode.
      
      Link: https://lkml.kernel.org/r/f3b1271262aa030c680e26615c1b9b2d71e55e92.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8fa826b7
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Fix Clark's email · bb1b24cf
      Daniel Bristot de Oliveira authored
      Clark's email is williams@redhat.com.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/6fa4b49e17ab8a1ff19c335ab7cde38d8afb0e29.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bb1b24cf
  14. Jun 18, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Do no increment trace_clock_global() by one · 89529d8b
      Steven Rostedt (VMware) authored
      
      
      The trace_clock_global() tries to make sure the events between CPUs is
      somewhat in order. A global value is used and updated by the latest read
      of a clock. If one CPU is ahead by a little, and is read by another CPU, a
      lock is taken, and if the timestamp of the other CPU is behind, it will
      simply use the other CPUs timestamp.
      
      The lock is also only taken with a "trylock" due to tracing, and strange
      recursions can happen. The lock is not taken at all in NMI context.
      
      In the case where the lock is not able to be taken, the non synced
      timestamp is returned. But it will not be less than the saved global
      timestamp.
      
      The problem arises because when the time goes "backwards" the time
      returned is the saved timestamp plus 1. If the lock is not taken, and the
      plus one to the timestamp is returned, there's a small race that can cause
      the time to go backwards!
      
      	CPU0				CPU1
      	----				----
      				trace_clock_global() {
      				    ts = clock() [ 1000 ]
      				    trylock(clock_lock) [ success ]
      				    global_ts = ts; [ 1000 ]
      
      				    <interrupted by NMI>
       trace_clock_global() {
          ts = clock() [ 999 ]
          if (ts < global_ts)
      	ts = global_ts + 1 [ 1001 ]
      
          trylock(clock_lock) [ fail ]
      
          return ts [ 1001]
       }
      				    unlock(clock_lock);
      				    return ts; [ 1000 ]
      				}
      
       trace_clock_global() {
          ts = clock() [ 1000 ]
          if (ts < global_ts) [ false 1000 == 1000 ]
      
          trylock(clock_lock) [ success ]
          global_ts = ts; [ 1000 ]
          unlock(clock_lock)
      
          return ts; [ 1000 ]
       }
      
      The above case shows to reads of trace_clock_global() on the same CPU, but
      the second read returns one less than the first read. That is, time when
      backwards, and this is not what is allowed by trace_clock_global().
      
      This was triggered by heavy tracing and the ring buffer checker that tests
      for the clock going backwards:
      
       Ring buffer clock went backwards: 20613921464 -> 20613921463
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
       Modules linked in:
       [..]
       [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
         [20613915818] PAGE TIME STAMP
         [20613915818] delta:0
         [20613915819] delta:1
         [20613916035] delta:216
         [20613916465] delta:430
         [20613916575] delta:110
         [20613916749] delta:174
         [20613917248] delta:499
         [20613917333] delta:85
         [20613917775] delta:442
         [20613917921] delta:146
         [20613918321] delta:400
         [20613918568] delta:247
         [20613918768] delta:200
         [20613919306] delta:538
         [20613919353] delta:47
         [20613919980] delta:627
         [20613920296] delta:316
         [20613920571] delta:275
         [20613920862] delta:291
         [20613921152] delta:290
         [20613921464] delta:312
         [20613921464] delta:0 TIME EXTEND
         [20613921464] delta:0
      
      This happened more than once, and always for an off by one result. It also
      started happening after commit aafe104a was added.
      
      Cc: stable@vger.kernel.org
      Fixes: aafe104a ("tracing: Restructure trace_clock_global() to never block")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      89529d8b
    • Steven Rostedt (VMware)'s avatar
      tracing: Do not stop recording comms if the trace file is being read · 4fdd595e
      Steven Rostedt (VMware) authored
      
      
      A while ago, when the "trace" file was opened, tracing was stopped, and
      code was added to stop recording the comms to saved_cmdlines, for mapping
      of the pids to the task name.
      
      Code has been added that only records the comm if a trace event occurred,
      and there's no reason to not trace it if the trace file is opened.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ffbd48d ("tracing: Cache comms only after an event occurred")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      4fdd595e
    • Steven Rostedt (VMware)'s avatar
      tracing: Do not stop recording cmdlines when tracing is off · 85550c83
      Steven Rostedt (VMware) authored
      
      
      The saved_cmdlines is used to map pids to the task name, such that the
      output of the tracing does not just show pids, but also gives a human
      readable name for the task.
      
      If the name is not mapped, the output looks like this:
      
          <...>-1316          [005] ...2   132.044039: ...
      
      Instead of this:
      
          gnome-shell-1316    [005] ...2   132.044039: ...
      
      The names are updated when tracing is running, but are skipped if tracing
      is stopped. Unfortunately, this stops the recording of the names if the
      top level tracer is stopped, and not if there's other tracers active.
      
      The recording of a name only happens when a new event is written into a
      ring buffer, so there is no need to test if tracing is on or not. If
      tracing is off, then no event is written and no need to test if tracing is
      off or not.
      
      Remove the check, as it hides the names of tasks for events in the
      instance buffers.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ffbd48d ("tracing: Cache comms only after an event occurred")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      85550c83
Loading