Skip to content
  1. Jun 30, 2021
  2. Jun 29, 2021
  3. Jun 28, 2021
  4. Jun 25, 2021
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Support hotplug operations · c8895e27
      Daniel Bristot de Oliveira authored
      Enable and disable osnoise/timerlat thread during on CPU hotplug online
      and offline operations respectivelly.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/39f98590b3caeb3c32f09526214058efe0e9272a.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      c8895e27
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Support hotplug operations · ba998f7d
      Daniel Bristot de Oliveira authored
      Enable and disable hwlat thread during cpu hotplug online
      and offline operations, respectivelly.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/52012d25ea35491a0f8088b947864d8df8e25157.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      ba998f7d
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Protect kdata->kthread with get/put_online_cpus · 039a602d
      Daniel Bristot de Oliveira authored
      In preparation to the hotplug support, protect kdata->kthread
      with get/put_online_cpus() to avoid concurrency with hotplug
      operations.
      
      Link: https://lore.kernel.org/linux-doc/20210621134636.5b332226@oasis.local.home/
      Link: https://lkml.kernel.org/r/8bdb2a56f46abfd301d6fffbf43448380c09a6f5.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Suggested-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      039a602d
    • Daniel Bristot de Oliveira's avatar
      trace: Add timerlat tracer · a955d7ea
      Daniel Bristot de Oliveira authored
      The timerlat tracer aims to help the preemptive kernel developers to
      found souces of wakeup latencies of real-time threads. Like cyclictest,
      the tracer sets a periodic timer that wakes up a thread. The thread then
      computes a *wakeup latency* value as the difference between the *current
      time* and the *absolute time* that the timer was set to expire. The main
      goal of timerlat is tracing in such a way to help kernel developers.
      
      Usage
      
      Write the ASCII text "timerlat" into the current_tracer file of the
      tracing system (generally mounted at /sys/kernel/tracing).
      
      For example:
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo timerlat > current_tracer
      
      It is possible to follow the trace by reading the trace trace file:
      
        [root@f32 tracing]# cat trace
        # tracer: timerlat
        #
        #                              _-----=> irqs-off
        #                             / _----=> need-resched
        #                            | / _---=> hardirq/softirq
        #                            || / _--=> preempt-depth
        #                            || /
        #                            ||||             ACTIVATION
        #         TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY
        #            | |         |   ||||      |         |                  |                       |
                <idle>-0       [000] d.h1    54.029328: #1     context    irq timer_latency       932 ns
                 <...>-867     [000] ....    54.029339: #1     context thread timer_latency     11700 ns
                <idle>-0       [001] dNh1    54.029346: #1     context    irq timer_latency      2833 ns
                 <...>-868     [001] ....    54.029353: #1     context thread timer_latency      9820 ns
                <idle>-0       [000] d.h1    54.030328: #2     context    irq timer_latency       769 ns
                 <...>-867     [000] ....    54.030330: #2     context thread timer_latency      3070 ns
                <idle>-0       [001] d.h1    54.030344: #2     context    irq timer_latency       935 ns
                 <...>-868     [001] ....    54.030347: #2     context thread timer_latency      4351 ns
      
      The tracer creates a per-cpu kernel thread with real-time priority that
      prints two lines at every activation. The first is the *timer latency*
      observed at the *hardirq* context before the activation of the thread.
      The second is the *timer latency* observed by the thread, which is the
      same level that cyclictest reports. The ACTIVATION ID field
      serves to relate the *irq* execution to its respective *thread* execution.
      
      The irq/thread splitting is important to clarify at which context
      the unexpected high value is coming from. The *irq* context can be
      delayed by hardware related actions, such as SMIs, NMIs, IRQs
      or by a thread masking interrupts. Once the timer happens, the delay
      can also be influenced by blocking caused by threads. For example, by
      postponing the scheduler execution via preempt_disable(),  by the
      scheduler execution, or by masking interrupts. Threads can
      also be delayed by the interference from other threads and IRQs.
      
      The timerlat can also take advantage of the osnoise: traceevents.
      For example:
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo timerlat > current_tracer
              [root@f32 tracing]# echo osnoise > set_event
              [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
              [root@f32 tracing]# tail -10 trace
                   cc1-87882   [005] d..h...   548.771078: #402268 context    irq timer_latency      1585 ns
                   cc1-87882   [005] dNLh1..   548.771082: irq_noise: local_timer:236 start 548.771077442 duration 4597 ns
                   cc1-87882   [005] dNLh2..   548.771083: irq_noise: reschedule:253 start 548.771083017 duration 56 ns
                   cc1-87882   [005] dNLh2..   548.771086: irq_noise: call_function_single:251 start 548.771083811 duration 2048 ns
                   cc1-87882   [005] dNLh2..   548.771088: irq_noise: call_function_single:251 start 548.771086814 duration 1495 ns
                   cc1-87882   [005] dNLh2..   548.771091: irq_noise: call_function_single:251 start 548.771089194 duration 1558 ns
                   cc1-87882   [005] dNLh2..   548.771094: irq_noise: call_function_single:251 start 548.771091719 duration 1932 ns
                   cc1-87882   [005] dNLh2..   548.771096: irq_noise: call_function_single:251 start 548.771094696 duration 1050 ns
                   cc1-87882   [005] d...3..   548.771101: thread_noise:      cc1:87882 start 548.771078243 duration 10909 ns
            timerlat/5-1035    [005] .......   548.771103: #402268 context thread timer_latency     25960 ns
      
      For further information see: Documentation/trace/timerlat-tracer.rst
      
      Link: https://lkml.kernel.org/r/71f18efc013e1194bcaea1e54db957de2b19ba62.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      a955d7ea
    • Daniel Bristot de Oliveira's avatar
      trace: Add osnoise tracer · bce29ac9
      Daniel Bristot de Oliveira authored
      In the context of high-performance computing (HPC), the Operating System
      Noise (*osnoise*) refers to the interference experienced by an application
      due to activities inside the operating system. In the context of Linux,
      NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
      system. Moreover, hardware-related jobs can also cause noise, for example,
      via SMIs.
      
      The osnoise tracer leverages the hwlat_detector by running a similar
      loop with preemption, SoftIRQs and IRQs enabled, thus allowing all
      the sources of *osnoise* during its execution. Using the same approach
      of hwlat, osnoise takes note of the entry and exit point of any
      source of interferences, increasing a per-cpu interference counter. The
      osnoise tracer also saves an interference counter for each source of
      interference. The interference counter for NMI, IRQs, SoftIRQs, and
      threads is increased anytime the tool observes these interferences' entry
      events. When a noise happens without any interference from the operating
      system level, the hardware noise counter increases, pointing to a
      hardware-related noise. In this way, osnoise can account for any
      source of interference. At the end of the period, the osnoise tracer
      prints the sum of all noise, the max single noise, the percentage of CPU
      available for the thread, and the counters for the noise sources.
      
      Usage
      
      Write the ASCII text "osnoise" into the current_tracer file of the
      tracing system (generally mounted at /sys/kernel/tracing).
      
      For example::
      
              [root@f32 ~]# cd /sys/kernel/tracing/
              [root@f32 tracing]# echo osnoise > current_tracer
      
      It is possible to follow the trace by reading the trace trace file::
      
              [root@f32 tracing]# cat trace
              # tracer: osnoise
              #
              #                                _-----=> irqs-off
              #                               / _----=> need-resched
              #                              | / _---=> hardirq/softirq
              #                              || / _--=> preempt-depth                            MAX
              #                              || /                                             SINGLE     Interference counters:
              #                              ||||               RUNTIME      NOISE   % OF CPU  NOISE    +-----------------------------+
              #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
              #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                         <...>-859     [000] ....    81.637220: 1000000        190  99.98100       9     18      0   1007     18      1
                         <...>-860     [001] ....    81.638154: 1000000        656  99.93440      74     23      0   1006     16      3
                         <...>-861     [002] ....    81.638193: 1000000       5675  99.43250     202      6      0   1013     25     21
                         <...>-862     [003] ....    81.638242: 1000000        125  99.98750      45      1      0   1011     23      0
                         <...>-863     [004] ....    81.638260: 1000000       1721  99.82790     168      7      0   1002     49     41
                         <...>-864     [005] ....    81.638286: 1000000        263  99.97370      57      6      0   1006     26      2
                         <...>-865     [006] ....    81.638302: 1000000        109  99.98910      21      3      0   1006     18      1
                         <...>-866     [007] ....    81.638326: 1000000       7816  99.21840     107      8      0   1016     39     19
      
      In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
      tracer prints a message at the end of each period for each CPU that is
      running an osnoise/CPU thread. The osnoise specific fields report:
      
       - The RUNTIME IN USE reports the amount of time in microseconds that
         the osnoise thread kept looping reading the time.
       - The NOISE IN US reports the sum of noise in microseconds observed
         by the osnoise tracer during the associated runtime.
       - The % OF CPU AVAILABLE reports the percentage of CPU available for
         the osnoise thread during the runtime window.
       - The MAX SINGLE NOISE IN US reports the maximum single noise observed
         during the runtime window.
       - The Interference counters display how many each of the respective
         interference happened during the runtime window.
      
      Note that the example above shows a high number of HW noise samples.
      The reason being is that this sample was taken on a virtual machine,
      and the host interference is detected as a hardware interference.
      
      Tracer options
      
      The tracer has a set of options inside the osnoise directory, they are:
      
       - osnoise/cpus: CPUs at which a osnoise thread will execute.
       - osnoise/period_us: the period of the osnoise thread.
       - osnoise/runtime_us: how long an osnoise thread will look for noise.
       - osnoise/stop_tracing_us: stop the system tracing if a single noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - osnoise/stop_tracing_total_us: stop the system tracing if total noise
         higher than the configured value happens. Writing 0 disables this
         option.
       - tracing_threshold: the minimum delta between two time() reads to be
         considered as noise, in us. When set to 0, the default value will
         be used, which is currently 5 us.
      
      Additional Tracing
      
      In addition to the tracer, a set of tracepoints were added to
      facilitate the identification of the osnoise source.
      
       - osnoise:sample_threshold: printed anytime a noise is higher than
         the configurable tolerance_ns.
       - osnoise:nmi_noise: noise from NMI, including the duration.
       - osnoise:irq_noise: noise from an IRQ, including the duration.
       - osnoise:softirq_noise: noise from a SoftIRQ, including the
         duration.
       - osnoise:thread_noise: noise from a thread, including the duration.
      
      Note that all the values are *net values*. For example, if while osnoise
      is running, another thread preempts the osnoise thread, it will start a
      thread_noise duration at the start. Then, an IRQ takes place, preempting
      the thread_noise, starting a irq_noise. When the IRQ ends its execution,
      it will compute its duration, and this duration will be subtracted from
      the thread_noise, in such a way as to avoid the double accounting of the
      IRQ execution. This logic is valid for all sources of noise.
      
      Here is one example of the usage of these tracepoints::
      
             osnoise/8-961     [008] d.h.  5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
             osnoise/8-961     [008] dNh.  5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
           migration/8-54      [008] d...  5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
             osnoise/8-961     [008] ....  5789.858413: sample_threshold: start 5789.858404555 duration 8723 ns interferences 2
      
      In this example, a noise sample of 8 microseconds was reported in the last
      line, pointing to two interferences. Looking backward in the trace, the
      two previous entries were about the migration thread running after a
      timer IRQ execution. The first event is not part of the noise because
      it took place one millisecond before.
      
      It is worth noticing that the sum of the duration reported in the
      tracepoints is smaller than eight us reported in the sample_threshold.
      The reason roots in the overhead of the entry and exit code that happens
      before and after any interference execution. This justifies the dual
      approach: measuring thread and tracing.
      
      Link: https://lkml.kernel.org/r/e649467042d60e7b62714c9c6751a56299d15119.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      [
        Made the following functions static:
         trace_irqentry_callback()
         trace_irqexit_callback()
         trace_intel_irqentry_callback()
         trace_intel_irqexit_callback()
      
        Added to include/trace.h:
         osnoise_arch_register()
         osnoise_arch_unregister()
      
        Fixed define logic for LATENCY_FS_NOTIFY
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      ]
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bce29ac9
    • Steven Rostedt (VMware)'s avatar
      tracing: Add LATENCY_FS_NOTIFY to define if latency_fsnotify() is defined · 6880c987
      Steven Rostedt (VMware) authored
      
      
      With the coming addition of the osnoise tracer, the configs needed to
      include the latency_fsnotify() has become more complex, and to keep the
      declaration in the header file the same as in the C file, just have the
      logic needed to define it in one place, and that defines LATENCY_FS_NOTIFY
      which will be used in the C code.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      6880c987
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Remove printk from sampling loop · aa892f8c
      Daniel Bristot de Oliveira authored
      hwlat has some time operation checks on the sample loop, and it is
      currently using pr_err (printk) to report them. The problem is that
      this can lead the system to an unresponsible state due to an overflow of
      printk messages. This problem can be mitigated by writing the error
      message to the trace buffer.
      
      Remove the printk messages from the sampling loop, switching the to
      messages in the trace buffer.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/9d77c34869748aa105e965c769d24642914eea3a.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      aa892f8c
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Use trace_min_max_param for width and window params · f27a1c9e
      Daniel Bristot de Oliveira authored
      Use the trace_min_max_param to reduce code duplication.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/b91accd5a7c6c14ea02d3379aae974ba22b47dd6.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f27a1c9e
    • Daniel Bristot de Oliveira's avatar
      trace: Add a generic function to read/write u64 values from tracefs · bc87cf0a
      Daniel Bristot de Oliveira authored
      The hwlat detector and (in preparation for) the osnoise/timerlat tracers
      have a set of u64 parameters that the user can read/write via tracefs.
      For instance, we have hwlat_detector's window and width.
      
      To reduce the code duplication, hwlat's window and width share the same
      read function. However, they do not share the write functions because
      they do different parameter checks. For instance, the width needs to
      be smaller than the window, while the window needs to be larger
      than the window. The same pattern repeats on osnoise/timerlat, and
      a large portion of the code was devoted to the write function.
      
      Despite having different checks, the write functions have the same
      structure:
      
         read a user-space buffer
         take the lock that protects the value
         check for minimum and maximum acceptable values
            save the value
         release the lock
         return success or error
      
      To reduce the code duplication also in the write functions, this patch
      provides a generic read and write implementation for u64 values that
      need to be within some minimum and/or maximum parameters, while
      (potentially) being protected by a lock.
      
      To use this interface, the structure trace_min_max_param needs to be
      filled:
      
       struct trace_min_max_param {
               struct mutex    *lock;
               u64             *val;
               u64             *min;
               u64             *max;
       };
      
      The desired value is stored on the variable pointed by *val. If *min
      points to a minimum acceptable value, it will be checked during the
      write operation. Likewise, if *max points to a maximum allowable value,
      it will be checked during the write operation. Finally, if *lock points
      to a mutex, it will be taken at the beginning of the operation and
      released at the end.
      
      The definition of a trace_min_max_param needs to passed as the
      (private) *data for tracefs_create_file(), and the trace_min_max_fops
      (added by this patch) as the *fops file_operations.
      
      Link: https://lkml.kernel.org/r/3e35760a7c8b5c55f16ae5ad5fc54a0e71cbe647.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bc87cf0a
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Implement the per-cpu mode · f46b1652
      Daniel Bristot de Oliveira authored
      
      
      Implements the per-cpu mode in which a sampling thread is created for
      each cpu in the "cpus" (and tracing_mask).
      
      The per-cpu mode has the potention to speed up the hwlat detection by
      running on multiple CPUs at the same time, at the cost of higher cpu
      usage with irqs disabled. Use with care.
      
      [
        Changed get_cpu_data() to static.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      ]
      
      Link: https://lkml.kernel.org/r/ec06d0ab340e8460d293772faba19ad8a5c371aa.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      f46b1652
  5. Jun 24, 2021
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Switch disable_migrate to mode none · 7bb7d802
      Daniel Bristot de Oliveira authored
      When in the round-robin mode, if the tracer detects a change in the
      hwlatd thread affinity by an external tool, e.g., taskset, the
      round-robin logic is disabled. The disable_migrate variable currently
      tracks this.
      
      With the addition of the "mode" config and the mode "none," the
      disable_migrate logic is equivalent to switch to the "none" mode.
      
      Hence, instead of using a hidden variable to track this behavior,
      switch the mode to none, informing the user about this change.
      
      Link: https://lkml.kernel.org/r/a679af672458d6b1f62252605905c5214030f247.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      7bb7d802
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Implement the mode config option · 8fa826b7
      Daniel Bristot de Oliveira authored
      Provides the "mode" config to the hardware latency detector. hwlatd has
      two different operation modes. The default mode is the "round-robin" one,
      in which a single hwlatd thread runs, migrating among the allowed CPUs in a
      "round-robin" fashion. This is the current behavior.
      
      The "none" sets the allowed cpumask for a single hwlatd thread at the
      startup, but skips the round-robin, letting the scheduler handle the
      migration.
      
      In preparation to the per-cpu mode.
      
      Link: https://lkml.kernel.org/r/f3b1271262aa030c680e26615c1b9b2d71e55e92.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8fa826b7
    • Daniel Bristot de Oliveira's avatar
      trace/hwlat: Fix Clark's email · bb1b24cf
      Daniel Bristot de Oliveira authored
      Clark's email is williams@redhat.com.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/6fa4b49e17ab8a1ff19c335ab7cde38d8afb0e29.1624372313.git.bristot@redhat.com
      
      
      
      Cc: Phil Auld <pauld@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Kate Carcia <kcarcia@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alexandre Chartre <alexandre.chartre@oracle.com>
      Cc: Clark Willaims <williams@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      bb1b24cf
  6. Jun 17, 2021
  7. Jun 10, 2021
  8. Jun 08, 2021
    • Liangyan's avatar
      tracing: Correct the length check which causes memory corruption · 3e08a9f9
      Liangyan authored
      We've suffered from severe kernel crashes due to memory corruption on
      our production environment, like,
      
      Call Trace:
      [1640542.554277] general protection fault: 0000 [#1] SMP PTI
      [1640542.554856] CPU: 17 PID: 26996 Comm: python Kdump: loaded Tainted:G
      [1640542.556629] RIP: 0010:kmem_cache_alloc+0x90/0x190
      [1640542.559074] RSP: 0018:ffffb16faa597df8 EFLAGS: 00010286
      [1640542.559587] RAX: 0000000000000000 RBX: 0000000000400200 RCX:
      0000000006e931bf
      [1640542.560323] RDX: 0000000006e931be RSI: 0000000000400200 RDI:
      ffff9a45ff004300
      [1640542.560996] RBP: 0000000000400200 R08: 0000000000023420 R09:
      0000000000000000
      [1640542.561670] R10: 0000000000000000 R11: 0000000000000000 R12:
      ffffffff9a20608d
      [1640542.562366] R13: ffff9a45ff004300 R14: ffff9a45ff004300 R15:
      696c662f65636976
      [1640542.563128] FS:  00007f45d7c6f740(0000) GS:ffff9a45ff840000(0000)
      knlGS:0000000000000000
      [1640542.563937] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [1640542.564557] CR2: 00007f45d71311a0 CR3: 000000189d63e004 CR4:
      00000000003606e0
      [1640542.565279] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
      0000000000000000
      [1640542.566069] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
      0000000000000400
      [1640542.566742] Call Trace:
      [1640542.567009]  anon_vma_clone+0x5d/0x170
      [1640542.567417]  __split_vma+0x91/0x1a0
      [1640542.567777]  do_munmap+0x2c6/0x320
      [1640542.568128]  vm_munmap+0x54/0x70
      [1640542.569990]  __x64_sys_munmap+0x22/0x30
      [1640542.572005]  do_syscall_64+0x5b/0x1b0
      [1640542.573724]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [1640542.575642] RIP: 0033:0x7f45d6e61e27
      
      James Wang has reproduced it stably on the latest 4.19 LTS.
      After some debugging, we finally proved that it's due to ftrace
      buffer out-of-bound access using a debug tool as follows:
      [   86.775200] BUG: Out-of-bounds write at addr 0xffff88aefe8b7000
      [   86.780806]  no_context+0xdf/0x3c0
      [   86.784327]  __do_page_fault+0x252/0x470
      [   86.788367]  do_page_fault+0x32/0x140
      [   86.792145]  page_fault+0x1e/0x30
      [   86.795576]  strncpy_from_unsafe+0x66/0xb0
      [   86.799789]  fetch_memory_string+0x25/0x40
      [   86.804002]  fetch_deref_string+0x51/0x60
      [   86.808134]  kprobe_trace_func+0x32d/0x3a0
      [   86.812347]  kprobe_dispatcher+0x45/0x50
      [   86.816385]  kprobe_ftrace_handler+0x90/0xf0
      [   86.820779]  ftrace_ops_assist_func+0xa1/0x140
      [   86.825340]  0xffffffffc00750bf
      [   86.828603]  do_sys_open+0x5/0x1f0
      [   86.832124]  do_syscall_64+0x5b/0x1b0
      [   86.835900]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      commit b220c049 ("tracing: Check length before giving out
      the filter buffer") adds length check to protect trace data
      overflow introduced in 0fc1b09f, seems that this fix can't prevent
      overflow entirely, the length check should also take the sizeof
      entry->array[0] into account, since this array[0] is filled the
      length of trace data and occupy addtional space and risk overflow.
      
      Link: https://lkml.kernel.org/r/20210607125734.1770447-1-liangyan.peng@linux.alibaba.com
      
      
      
      Cc: stable@vger.kernel.org
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Xunlei Pang <xlpang@linux.alibaba.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Fixes: b220c049 ("tracing: Check length before giving out the filter buffer")
      Reviewed-by: default avatarXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: default avataryinbinbin <yinbinbin@alibabacloud.com>
      Reviewed-by: default avatarWetp Zhang <wetp.zy@linux.alibaba.com>
      Tested-by: default avatarJames Wang <jnwang@linux.alibaba.com>
      Signed-off-by: default avatarLiangyan <liangyan.peng@linux.alibaba.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      3e08a9f9
    • Steven Rostedt (VMware)'s avatar
      ftrace: Do not blindly read the ip address in ftrace_bug() · 6c14133d
      Steven Rostedt (VMware) authored
      It was reported that a bug on arm64 caused a bad ip address to be used for
      updating into a nop in ftrace_init(), but the error path (rightfully)
      returned -EINVAL and not -EFAULT, as the bug caused more than one error to
      occur. But because -EINVAL was returned, the ftrace_bug() tried to report
      what was at the location of the ip address, and read it directly. This
      caused the machine to panic, as the ip was not pointing to a valid memory
      address.
      
      Instead, read the ip address with copy_from_kernel_nofault() to safely
      access the memory, and if it faults, report that the address faulted,
      otherwise report what was in that location.
      
      Link: https://lore.kernel.org/lkml/20210607032329.28671-1-mark-pk.tsai@mediatek.com/
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: 05736a42 ("ftrace: warn on failure to disable mcount callers")
      Reported-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Tested-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      6c14133d
  9. Jun 02, 2021
    • Daniel Borkmann's avatar
      bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks · ff40e510
      Daniel Borkmann authored
      Commit 59438b46 ("security,lockdown,selinux: implement SELinux lockdown")
      added an implementation of the locked_down LSM hook to SELinux, with the aim
      to restrict which domains are allowed to perform operations that would breach
      lockdown. This is indirectly also getting audit subsystem involved to report
      events. The latter is problematic, as reported by Ondrej and Serhei, since it
      can bring down the whole system via audit:
      
        1) The audit events that are triggered due to calls to security_locked_down()
           can OOM kill a machine, see below details [0].
      
        2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
           when trying to wake up kauditd, for example, when using trace_sched_switch()
           tracepoint, see details in [1]. Triggering this was not via some hypothetical
           corner case, but with existing tools like runqlat & runqslower from bcc, for
           example, which make use of this tracepoint. Rough call sequence goes like:
      
           rq_lock(rq) -> -------------------------+
             trace_sched_switch() ->               |
               bpf_prog_xyz() ->                   +-> deadlock
                 selinux_lockdown() ->             |
                   audit_log_end() ->              |
                     wake_up_interruptible() ->    |
                       try_to_wake_up() ->         |
                         rq_lock(rq) --------------+
      
      What's worse is that the intention of 59438b46 to further restrict lockdown
      settings for specific applications in respect to the global lockdown policy is
      completely broken for BPF. The SELinux policy rule for the current lockdown check
      looks something like this:
      
        allow <who> <who> : lockdown { <reason> };
      
      However, this doesn't match with the 'current' task where the security_locked_down()
      is executed, example: httpd does a syscall. There is a tracing program attached
      to the syscall which triggers a BPF program to run, which ends up doing a
      bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
      the permission check against 'current', that is, httpd in this example. httpd
      has literally zero relation to this tracing program, and it would be nonsensical
      having to write an SELinux policy rule against httpd to let the tracing helper
      pass. The policy in this case needs to be against the entity that is installing
      the BPF program. For example, if bpftrace would generate a histogram of syscall
      counts by user space application:
      
        bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
      
      bpftrace would then go and generate a BPF program from this internally. One way
      of doing it [for the sake of the example] could be to call bpf_get_current_task()
      helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
      helpers. So the program itself has nothing to do with httpd or any other random
      app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
      check. The allow/deny policy belongs in the context of bpftrace: meaning, you
      want to grant bpftrace access to use these helpers, but other tracers on the
      system like my_random_tracer _not_.
      
      Therefore fix all three issues at the same time by taking a completely different
      approach for the security_locked_down() hook, that is, move the check into the
      program verification phase where we actually retrieve the BPF func proto. This
      also reliably gets the task (current) that is trying to install the BPF tracing
      program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
      we're moving this out of the BPF helper's fast-path which can be called several
      millions of times per second.
      
      The check is then also in line with other security_locked_down() hooks in the
      system where the enforcement is performed at open/load time, for example,
      open_kcore() for /proc/kcore access or module_sig_check() for module signatures
      just to pick few random ones. What's out of scope in the fix as well as in
      other security_locked_down() hook locations /outside/ of BPF subsystem is that
      if the lockdown policy changes on the fly there is no retrospective action.
      This requires a different discussion, potentially complex infrastructure, and
      it's also not clear whether this can be solved generically. Either way, it is
      out of scope for a suitable stable fix which this one is targeting. Note that
      the breakage is specifically on 59438b46 where it started to rely on 'current'
      as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5 ("bpf:
      Restrict bpf when kernel lockdown is in confidentiality mode").
      
      [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:
      
        I starting seeing this with F-34. When I run a container that is traced with
        BPF to record the syscalls it is doing, auditd is flooded with messages like:
      
        type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
          for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
            scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
              tclass=lockdown permissive=0
      
        This seems to be leading to auditd running out of space in the backlog buffer
        and eventually OOMs the machine.
      
        [...]
        auditd running at 99% CPU presumably processing all the messages, eventually I get:
        Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
        Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
        Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
        Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
        Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
        Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
        Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
        Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
        Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        [...]
      
      [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/
      
      ,
          Serhei Makarov says:
      
        Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
        bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
        is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
        testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
        ppc64le. Example stack trace:
      
        [...]
        [  730.868702] stack backtrace:
        [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
        [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        [  730.873278] Call Trace:
        [  730.873770]  dump_stack+0x7f/0xa1
        [  730.874433]  check_noncircular+0xdf/0x100
        [  730.875232]  __lock_acquire+0x1202/0x1e10
        [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
        [  730.876844]  lock_acquire+0xc2/0x3a0
        [  730.877551]  ? __wake_up_common_lock+0x52/0x90
        [  730.878434]  ? lock_acquire+0xc2/0x3a0
        [  730.879186]  ? lock_is_held_type+0xa7/0x120
        [  730.880044]  ? skb_queue_tail+0x1b/0x50
        [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
        [  730.881656]  ? __wake_up_common_lock+0x52/0x90
        [  730.882532]  __wake_up_common_lock+0x52/0x90
        [  730.883375]  audit_log_end+0x5b/0x100
        [  730.884104]  slow_avc_audit+0x69/0x90
        [  730.884836]  avc_has_perm+0x8b/0xb0
        [  730.885532]  selinux_lockdown+0xa5/0xd0
        [  730.886297]  security_locked_down+0x20/0x40
        [  730.887133]  bpf_probe_read_compat+0x66/0xd0
        [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
        [  730.888917]  trace_call_bpf+0xe9/0x240
        [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
        [  730.890579]  perf_trace_sched_switch+0x142/0x180
        [  730.891485]  ? __schedule+0x6d8/0xb20
        [  730.892209]  __schedule+0x6d8/0xb20
        [  730.892899]  schedule+0x5b/0xc0
        [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
        [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
        [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [...]
      
      Fixes: 59438b46 ("security,lockdown,selinux: implement SELinux lockdown")
      Reported-by: default avatarOndrej Mosnacek <omosnace@redhat.com>
      Reported-by: default avatarJakub Hrozek <jhrozek@redhat.com>
      Reported-by: default avatarSerhei Makarov <smakarov@redhat.com>
      Reported-by: default avatarJiri Olsa <jolsa@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Tested-by: default avatarJiri Olsa <jolsa@redhat.com>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: James Morris <jamorris@linux.microsoft.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Frank Eigler <fche@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lore.kernel.org/bpf/01135120-8bf7-df2e-cff0-1d73f1f841c3@iogearbox.net
      ff40e510
  10. May 13, 2021
  11. May 05, 2021
    • Steven Rostedt (VMware)'s avatar
      ftrace: Handle commands when closing set_ftrace_filter file · 8c9af478
      Steven Rostedt (VMware) authored
      
      
       # echo switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
      
      will cause switch_mm to stop tracing by the traceoff command.
      
       # echo -n switch_mm:traceoff > /sys/kernel/tracing/set_ftrace_filter
      
      does nothing.
      
      The reason is that the parsing in the write function only processes
      commands if it finished parsing (there is white space written after the
      command). That's to handle:
      
       write(fd, "switch_mm:", 10);
       write(fd, "traceoff", 8);
      
      cases, where the command is broken over multiple writes.
      
      The problem is if the file descriptor is closed, then the write call is
      not processed, and the command needs to be processed in the release code.
      The release code can handle matching of functions, but does not handle
      commands.
      
      Cc: stable@vger.kernel.org
      Fixes: eda1e328 ("tracing: handle broken names in ftrace filter")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      8c9af478
  12. Apr 30, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Restructure trace_clock_global() to never block · aafe104a
      Steven Rostedt (VMware) authored
      It was reported that a fix to the ring buffer recursion detection would
      cause a hung machine when performing suspend / resume testing. The
      following backtrace was extracted from debugging that case:
      
      Call Trace:
       trace_clock_global+0x91/0xa0
       __rb_reserve_next+0x237/0x460
       ring_buffer_lock_reserve+0x12a/0x3f0
       trace_buffer_lock_reserve+0x10/0x50
       __trace_graph_return+0x1f/0x80
       trace_graph_return+0xb7/0xf0
       ? trace_clock_global+0x91/0xa0
       ftrace_return_to_handler+0x8b/0xf0
       ? pv_hash+0xa0/0xa0
       return_to_handler+0x15/0x30
       ? ftrace_graph_caller+0xa0/0xa0
       ? trace_clock_global+0x91/0xa0
       ? __rb_reserve_next+0x237/0x460
       ? ring_buffer_lock_reserve+0x12a/0x3f0
       ? trace_event_buffer_lock_reserve+0x3c/0x120
       ? trace_event_buffer_reserve+0x6b/0xc0
       ? trace_event_raw_event_device_pm_callback_start+0x125/0x2d0
       ? dpm_run_callback+0x3b/0xc0
       ? pm_ops_is_empty+0x50/0x50
       ? platform_get_irq_byname_optional+0x90/0x90
       ? trace_device_pm_callback_start+0x82/0xd0
       ? dpm_run_callback+0x49/0xc0
      
      With the following RIP:
      
      RIP: 0010:native_queued_spin_lock_slowpath+0x69/0x200
      
      Since the fix to the recursion detection would allow a single recursion to
      happen while tracing, this lead to the trace_clock_global() taking a spin
      lock and then trying to take it again:
      
      ring_buffer_lock_reserve() {
        trace_clock_global() {
          arch_spin_lock() {
            queued_spin_lock_slowpath() {
              /* lock taken */
              (something else gets traced by function graph tracer)
                ring_buffer_lock_reserve() {
                  trace_clock_global() {
                    arch_spin_lock() {
                      queued_spin_lock_slowpath() {
                      /* DEAD LOCK! */
      
      Tracing should *never* block, as it can lead to strange lockups like the
      above.
      
      Restructure the trace_clock_global() code to instead of simply taking a
      lock to update the recorded "prev_time" simply use it, as two events
      happening on two different CPUs that calls this at the same time, really
      doesn't matter which one goes first. Use a trylock to grab the lock for
      updating the prev_time, and if it fails, simply try again the next time.
      If it failed to be taken, that means something else is already updating
      it.
      
      Link: https://lkml.kernel.org/r/20210430121758.650b6e8a@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Tested-by: default avatarKonstantin Kharlamov <hi-angel@yandex.ru>
      Tested-by: default avatarTodd Brandt <todd.e.brandt@linux.intel.com>
      Fixes: b02414c8 ("ring-buffer: Fix recursion protection transitions between interrupt context") # started showing the problem
      Fixes: 14131f2f ("tracing: implement trace_clock_*() APIs") # where the bug happened
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212761
      
      
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      aafe104a
  13. Apr 28, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Map all PIDs to command lines · 785e3c0a
      Steven Rostedt (VMware) authored
      The default max PID is set by PID_MAX_DEFAULT, and the tracing
      infrastructure uses this number to map PIDs to the comm names of the
      tasks, such output of the trace can show names from the recorded PIDs in
      the ring buffer. This mapping is also exported to user space via the
      "saved_cmdlines" file in the tracefs directory.
      
      But currently the mapping expects the PIDs to be less than
      PID_MAX_DEFAULT, which is the default maximum and not the real maximum.
      Recently, systemd will increases the maximum value of a PID on the system,
      and when tasks are traced that have a PID higher than PID_MAX_DEFAULT, its
      comm is not recorded. This leads to the entire trace to have "<...>" as
      the comm name, which is pretty useless.
      
      Instead, keep the array mapping the size of PID_MAX_DEFAULT, but instead
      of just mapping the index to the comm, map a mask of the PID
      (PID_MAX_DEFAULT - 1) to the comm, and find the full PID from the
      map_cmdline_to_pid array (that already exists).
      
      This bug goes back to the beginning of ftrace, but hasn't been an issue
      until user space started increasing the maximum value of PIDs.
      
      Link: https://lkml.kernel.org/r/20210427113207.3c601884@gandalf.local.home
      
      
      
      Cc: stable@vger.kernel.org
      Fixes: bc0c38d1 ("ftrace: latency tracer infrastructure")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      785e3c0a
  14. Apr 27, 2021
    • Florent Revest's avatar
      bpf: Implement formatted output helpers with bstr_printf · 48cac3f4
      Florent Revest authored
      
      
      BPF has three formatted output helpers: bpf_trace_printk, bpf_seq_printf
      and bpf_snprintf. Their signatures specify that all arguments are
      provided from the BPF world as u64s (in an array or as registers). All
      of these helpers are currently implemented by calling functions such as
      snprintf() whose signatures take a variable number of arguments, then
      placed in a va_list by the compiler to call vsnprintf().
      
      "d9c9e4db bpf: Factorize bpf_trace_printk and bpf_seq_printf" introduced
      a bpf_printf_prepare function that fills an array of u64 sanitized
      arguments with an array of "modifiers" which indicate what the "real"
      size of each argument should be (given by the format specifier). The
      BPF_CAST_FMT_ARG macro consumes these arrays and casts each argument to
      its real size. However, the C promotion rules implicitely cast them all
      back to u64s. Therefore, the arguments given to snprintf are u64s and
      the va_list constructed by the compiler will use 64 bits for each
      argument. On 64 bit machines, this happens to work well because 32 bit
      arguments in va_lists need to occupy 64 bits anyway, but on 32 bit
      architectures this breaks the layout of the va_list expected by the
      called function and mangles values.
      
      In "88a5c690 bpf: fix bpf_trace_printk on 32 bit archs", this problem
      had been solved for bpf_trace_printk only with a "horrid workaround"
      that emitted multiple calls to trace_printk where each call had
      different argument types and generated different va_list layouts. One of
      the call would be dynamically chosen at runtime. This was ok with the 3
      arguments that bpf_trace_printk takes but bpf_seq_printf and
      bpf_snprintf accept up to 12 arguments. Because this approach scales
      code exponentially, it is not a viable option anymore.
      
      Because the promotion rules are part of the language and because the
      construction of a va_list is an arch-specific ABI, it's best to just
      avoid variadic arguments and va_lists altogether. Thankfully the
      kernel's snprintf() has an alternative in the form of bstr_printf() that
      accepts arguments in a "binary buffer representation". These binary
      buffers are currently created by vbin_printf and used in the tracing
      subsystem to split the cost of printing into two parts: a fast one that
      only dereferences and remembers values, and a slower one, called later,
      that does the pretty-printing.
      
      This patch refactors bpf_printf_prepare to construct binary buffers of
      arguments consumable by bstr_printf() instead of arrays of arguments and
      modifiers. This gets rid of BPF_CAST_FMT_ARG and greatly simplifies the
      bpf_printf_prepare usage but there are a few gotchas that change how
      bpf_printf_prepare needs to do things.
      
      Currently, bpf_printf_prepare uses a per cpu temporary buffer as a
      generic storage for strings and IP addresses. With this refactoring, the
      temporary buffers now holds all the arguments in a structured binary
      format.
      
      To comply with the format expected by bstr_printf, certain format
      specifiers also need to be pre-formatted: %pB and %pi6/%pi4/%pI4/%pI6.
      Because vsnprintf subroutines for these specifiers are hard to expose,
      we pre-format these arguments with calls to snprintf().
      
      Reported-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427174313.860948-3-revest@chromium.org
      48cac3f4
    • Florent Revest's avatar
      bpf: Lock bpf_trace_printk's tmp buf before it is written to · 38d26d89
      Florent Revest authored
      
      
      bpf_trace_printk uses a shared static buffer to hold strings before they
      are printed. A recent refactoring moved the locking of that buffer after
      it gets filled by mistake.
      
      Fixes: d9c9e4db ("bpf: Factorize bpf_trace_printk and bpf_seq_printf")
      Reported-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarFlorent Revest <revest@chromium.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210427112958.773132-1-revest@chromium.org
      38d26d89
Loading