Skip to content
  1. Apr 02, 2015
    • Stephane Eranian's avatar
      watchdog: Add watchdog enable/disable all functions · b3738d29
      Stephane Eranian authored
      
      
      This patch adds two new functions to enable/disable
      the watchdog across all CPUs.
      
      This will be used by the HT PMU bug workaround code to
      disable/enable the NMI watchdog across quirk enablement.
      
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: bp@alien8.de
      Cc: jolsa@redhat.com
      Cc: kan.liang@intel.com
      Cc: maria.n.dimakopoulou@gmail.com
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1416251225-17721-12-git-send-email-eranian@google.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b3738d29
    • Alexander Shishkin's avatar
      perf: Add ITRACE_START record to indicate that tracing has started · ec0d7729
      Alexander Shishkin authored
      
      
      For counters that generate AUX data that is bound to the context of a
      running task, such as instruction tracing, the decoder needs to know
      exactly which task is running when the event is first scheduled in,
      before the first sched_switch. The decoder's need to know this stems
      from the fact that instruction flow trace decoding will almost always
      require program's object code in order to reconstruct said flow and
      for that we need at least its pid/tid in the perf stream.
      
      To single out such instruction tracing pmus, this patch introduces
      ITRACE PMU capability. The reason this is not part of RECORD_AUX
      record is that not all pmus capable of generating AUX data need this,
      and the opposite is *probably* also true.
      
      While sched_switch covers for most cases, there are two problems with it:
      the consumer will need to process events out of order (that is, having
      found RECORD_AUX, it will have to skip forward to the nearest sched_switch
      to figure out which task it was, then go back to the actual trace to
      decode it) and it completely misses the case when the tracing is enabled
      and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ec0d7729
    • Alexander Shishkin's avatar
      perf: Add wakeup watermark control to the AUX area · 1a594131
      Alexander Shishkin authored
      
      
      When AUX area gets a certain amount of new data, we want to wake up
      userspace to collect it. This adds a new control to specify how much
      data will cause a wakeup. This is then passed down to pmu drivers via
      output handle's "wakeup" field, so that the driver can find the nearest
      point where it can generate an interrupt.
      
      We repurpose __reserved_2 in the event attribute for this, even though
      it was never checked to be zero before, aux_watermark will only matter
      for new AUX-aware code, so the old code should still be fine.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1a594131
    • Alexander Shishkin's avatar
      perf: Support overwrite mode for the AUX area · 2023a0d2
      Alexander Shishkin authored
      
      
      This adds support for overwrite mode in the AUX area, which means "keep
      collecting data till you're stopped", turning AUX area into a circular
      buffer, where new data overwrites old data. It does not depend on data
      buffer's overwrite mode, so that it doesn't lose sideband data that is
      instrumental for processing AUX data.
      
      Overwrite mode is enabled at mapping AUX area read only. Even though
      aux_tail in the buffer's user page might be user writable, it will be
      ignored in this mode.
      
      A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
      data stream every time an event writes new data to the AUX area. The pmu
      driver might not be able to infer the exact beginning of the new data in
      each snapshot, some drivers will only provide the tail, which is
      aux_offset + aux_size in the AUX record. Consumer has to be able to tell
      the new data from the old one, for example, by means of time stamps if
      such are provided in the trace.
      
      Consumer is also responsible for disabling any events that might write
      to the AUX area (thus potentially racing with the consumer) before
      collecting the data.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2023a0d2
    • Alexander Shishkin's avatar
      perf: Add API for PMUs to write to the AUX area · fdc26706
      Alexander Shishkin authored
      
      
      For pmus that wish to write data to ring buffer's AUX area, provide
      perf_aux_output_{begin,end}() calls to initiate/commit data writes,
      similarly to perf_output_{begin,end}. These also use the same output
      handle structure. Also, similarly to software counterparts, these
      will direct inherited events' output to parents' ring buffers.
      
      After the perf_aux_output_begin() returns successfully, handle->size
      is set to the maximum amount of data that can be written wrt aux_tail
      pointer, so that no data that the user hasn't seen will be overwritten,
      therefore this should always be called before hardware writing is
      enabled. On success, this will return the pointer to pmu driver's
      private structure allocated for this aux area by pmu::setup_aux. Same
      pointer can also be retrieved using perf_get_aux() while hardware
      writing is enabled.
      
      PMU driver should pass the actual amount of data written as a parameter
      to perf_aux_output_end(). All hardware writes should be completed and
      visible before this one is called.
      
      Additionally, perf_aux_output_skip() will adjust output handle and
      aux_head in case some part of the buffer has to be skipped over to
      maintain hardware's alignment constraints.
      
      Nested writers are forbidden and guards are in place to catch such
      attempts.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fdc26706
    • Alexander Shishkin's avatar
      perf: Add AUX record · 68db7e98
      Alexander Shishkin authored
      
      
      When there's new data in the AUX space, output a record indicating its
      offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
      mean the described data was truncated to fit in the ring buffer.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      68db7e98
    • Alexander Shishkin's avatar
      perf: Add a pmu capability for "exclusive" events · bed5b25a
      Alexander Shishkin authored
      
      
      Usually, pmus that do, for example, instruction tracing, would only ever
      be able to have one event per task per cpu (or per perf_event_context). For
      such pmus it makes sense to disallow creating conflicting events early on,
      so as to provide consistent behavior for the user.
      
      This patch adds a pmu capability that indicates such constraint on event
      creation.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      bed5b25a
    • Alexander Shishkin's avatar
      perf: Add a capability for AUX_NO_SG pmus to do software double buffering · 6a279230
      Alexander Shishkin authored
      
      
      For pmus that don't support scatter-gather for AUX data in hardware, it
      might still make sense to implement software double buffering to avoid
      losing data while the user is reading data out. For this purpose, add
      a pmu capability that guarantees multiple high-order chunks for AUX buffer,
      so that the pmu driver can do switchover tricks.
      
      To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
      pmu's capability mask. This will make the ring buffer AUX allocation code
      ensure that the biggest high order allocation for the aux buffer pages is
      no bigger than half of the total requested buffer size, thus making sure
      that the buffer has at least two high order allocations.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6a279230
    • Alexander Shishkin's avatar
      perf: Support high-order allocations for AUX space · 0a4e38e6
      Alexander Shishkin authored
      
      
      Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
      don't support scatter-gather and will prefer larger contiguous areas for
      their output regions.
      
      This patch adds a new pmu capability to request higher order allocations.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0a4e38e6
    • Peter Zijlstra's avatar
      perf: Add AUX area to ring buffer for raw data streams · 45bfb2e5
      Peter Zijlstra authored
      
      
      This patch introduces "AUX space" in the perf mmap buffer, intended for
      exporting high bandwidth data streams to userspace, such as instruction
      flow traces.
      
      AUX space is a ring buffer, defined by aux_{offset,size} fields in the
      user_page structure, and read/write pointers aux_{head,tail}, which abide
      by the same rules as data_* counterparts of the main perf buffer.
      
      In order to allocate/mmap AUX, userspace needs to set up aux_offset to
      such an offset that will be greater than data_offset+data_size and
      aux_size to be the desired buffer size. Both need to be page aligned.
      Then, same aux_offset and aux_size should be passed to mmap() call and
      if everything adds up, you should have an AUX buffer as a result.
      
      Pages that are mapped into this buffer also come out of user's mlock
      rlimit plus perf_event_mlock_kb allowance.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      45bfb2e5
    • Alexander Shishkin's avatar
      perf: Add data_{offset,size} to user_page · e8c6deac
      Alexander Shishkin authored
      
      
      Currently, the actual perf ring buffer is one page into the mmap area,
      following the user page and the userspace follows this convention. This
      patch adds data_{offset,size} fields to user_page that can be used by
      userspace instead for locating perf data in the mmap area. This is also
      helpful when mapping existing or shared buffers if their size is not
      known in advance.
      
      Right now, it is made to follow the existing convention that
      
      	data_offset == PAGE_SIZE and
      	data_offset + data_size == mmap_size.
      
      Signed-off-by: default avatarAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kaixu Xia <kaixu.xia@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@infradead.org
      Cc: adrian.hunter@intel.com
      Cc: kan.liang@intel.com
      Cc: markus.t.metzger@intel.com
      Cc: mathieu.poirier@linaro.org
      Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e8c6deac
    • Ingo Molnar's avatar
      bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable · e1abf2cc
      Ingo Molnar authored
      
      
      So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
      dependency, it also depends on the tracing infrastructure and on kprobes,
      without which it will fail to build with:
      
        In file included from kernel/trace/bpf_trace.c:14:0:
        kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
        kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
          unsigned int val = current->trace_recursion;
        [...]
      
      It took quite some time to trigger this build failure, because right now
      BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
      more configurable, not just under CONFIG_EXPERT.
      
      If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
      gateway as well.
      
      We might want to make this an interactive option later on, although
      I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
      an indicator that the user wants BPF support.
      
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e1abf2cc
    • Alexei Starovoitov's avatar
      tracing: Allow BPF programs to call bpf_trace_printk() · 9c959c86
      Alexei Starovoitov authored
      
      
      Debugging of BPF programs needs some form of printk from the
      program, so let programs call limited trace_printk() with %d %u
      %x %p modifiers only.
      
      Similar to kernel modules, during program load verifier checks
      whether program is calling bpf_trace_printk() and if so, kernel
      allocates trace_printk buffers and emits big 'this is debug
      only' banner.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9c959c86
    • Alexei Starovoitov's avatar
      tracing: Allow BPF programs to call bpf_ktime_get_ns() · d9847d31
      Alexei Starovoitov authored
      
      
      bpf_ktime_get_ns() is used by programs to compute time delta
      between events or as a timestamp
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-5-git-send-email-ast@plumgrid.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d9847d31
    • Alexei Starovoitov's avatar
      tracing, perf: Implement BPF programs attached to kprobes · 2541517c
      Alexei Starovoitov authored
      
      
      BPF programs, attached to kprobes, provide a safe way to execute
      user-defined BPF byte-code programs without being able to crash or
      hang the kernel in any way. The BPF engine makes sure that such
      programs have a finite execution time and that they cannot break
      out of their sandbox.
      
      The user interface is to attach to a kprobe via the perf syscall:
      
      	struct perf_event_attr attr = {
      		.type	= PERF_TYPE_TRACEPOINT,
      		.config	= event_id,
      		...
      	};
      
      	event_fd = perf_event_open(&attr,...);
      	ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);
      
      'prog_fd' is a file descriptor associated with BPF program
      previously loaded.
      
      'event_id' is an ID of the kprobe created.
      
      Closing 'event_fd':
      
      	close(event_fd);
      
      ... automatically detaches BPF program from it.
      
      BPF programs can call in-kernel helper functions to:
      
        - lookup/update/delete elements in maps
      
        - probe_read - wraper of probe_kernel_read() used to access any
          kernel data structures
      
      BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
      architecture dependent) and return 0 to ignore the event and 1 to store
      kprobe event into the ring buffer.
      
      Note, kprobes are a fundamentally _not_ a stable kernel ABI,
      so BPF programs attached to kprobes must be recompiled for
      every kernel version and user must supply correct LINUX_VERSION_CODE
      in attr.kern_version during bpf_prog_load() call.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2541517c
    • Alexei Starovoitov's avatar
      tracing: Add kprobe flag · 72cbbc89
      Alexei Starovoitov authored
      
      
      add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
      tracepoints, since bpf programs can only be attached to kprobe
      type of PERF_TYPE_TRACEPOINT perf events.
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      72cbbc89
  2. Mar 27, 2015
  3. Mar 25, 2015
    • Mel Gorman's avatar
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman authored
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
      
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reported-by: default avatarDave Chinner <david@fromorbit.com>
      Tested-by: default avatarDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
  4. Mar 23, 2015
    • Peter Zijlstra's avatar
      perf: Remove type specific target pointers · 50f16a8b
      Peter Zijlstra authored
      
      
      The only reason CQM had to use a hard-coded pmu type was so it could use
      cqm_target in hw_perf_event.
      
      Do away with the {tp,bp,cqm}_target pointers and provide a non type
      specific one.
      
      This allows us to do away with that silly pmu type as well.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Vince Weaver <vince@deater.net>
      Cc: acme@kernel.org
      Cc: acme@redhat.com
      Cc: hpa@zytor.com
      Cc: jolsa@redhat.com
      Cc: kanaka.d.juvva@intel.com
      Cc: matt.fleming@intel.com
      Cc: tglx@linutronix.de
      Cc: torvalds@linux-foundation.org
      Cc: vikas.shivappa@linux.intel.com
      Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      50f16a8b
    • Preeti U Murthy's avatar
      timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop · a127d2bc
      Preeti U Murthy authored
      
      
      The hrtimer mode of broadcast queues hrtimers in the idle entry
      path so as to wakeup cpus in deep idle states. The associated
      call graph is :
      
      	cpuidle_idle_call()
      	|____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....))
      	     |_____tick_broadcast_set_event()
      		   |____clockevents_program_event()
      			|____bc_set_next()
      
      The hrtimer_{start/cancel} functions call into tracing which uses RCU.
      But it is not legal to call into RCU in cpuidle because it is one of the
      quiescent states. Hence protect this region with RCU_NONIDLE which informs
      RCU that the cpu is momentarily non-idle.
      
      As an aside it is helpful to point out that the clock event device that is
      programmed here is not a per-cpu clock device; it is a
      pseudo clock device, used by the broadcast framework alone.
      The per-cpu clock device programming never goes through bc_set_next().
      
      Signed-off-by: default avatarPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: linuxppc-dev@ozlabs.org
      Cc: mpe@ellerman.id.au
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a127d2bc
    • Peter Zijlstra's avatar
      lockdep: Fix the module unload key range freeing logic · 35a9393c
      Peter Zijlstra authored
      
      
      Module unload calls lockdep_free_key_range(), which removes entries
      from the data structures. Most of the lockdep code OTOH assumes the
      data structures are append only; in specific see the comments in
      add_lock_to_list() and look_up_lock_class().
      
      Clearly this has only worked by accident; make it work proper. The
      actual scenario to make it go boom would involve the memory freed by
      the module unlock being re-allocated and re-used for a lock inside of
      a rcu-sched grace period. This is a very unlikely scenario, still
      better plug the hole.
      
      Use RCU list iteration in all places and ammend the comments.
      
      Change lockdep_free_key_range() to issue a sync_sched() between
      removal from the lists and returning -- which results in the memory
      being freed. Further ensure the callers are placed correctly and
      comment the requirements.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Tsyvarev <tsyvarev@ispras.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      35a9393c
    • Brian Silverman's avatar
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman authored
      
      
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      
      Signed-off-by: default avatarBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      746db944
    • Peter Zijlstra's avatar
      perf: Fix irq_work 'tail' recursion · d525211f
      Peter Zijlstra authored
      
      
      Vince reported a watchdog lockup like:
      
      	[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
      	[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
      	[<ffffffff810b7f10>] lock_release+0x130/0x260
      	[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
      	[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
      	[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
      	[<ffffffff811f71ce>] send_sigio+0xae/0x100
      	[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
      	[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
      	[<ffffffff8115d103>] perf_pending_event+0x33/0x60
      	[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
      	[<ffffffff8114e448>] irq_work_run+0x18/0x40
      	[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
      	[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
      
      Which is caused by an irq_work generating new irq_work and therefore
      not allowing forward progress.
      
      This happens because processing the perf irq_work triggers another
      perf event (tracepoint stuff) which in turn generates an irq_work ad
      infinitum.
      
      Avoid this by raising the recursion counter in the irq_work -- which
      effectively disables all software events (including tracepoints) from
      actually triggering again.
      
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Tested-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d525211f
  5. Mar 17, 2015
    • Petr Mladek's avatar
      livepatch: Fix subtle race with coming and going modules · 8cb2c2dc
      Petr Mladek authored
      
      
      There is a notifier that handles live patches for coming and going modules.
      It takes klp_mutex lock to avoid races with coming and going patches but
      it does not keep the lock all the time. Therefore the following races are
      possible:
      
        1. The notifier is called sometime in STATE_MODULE_COMING. The module
           is visible by find_module() in this state all the time. It means that
           new patch can be registered and enabled even before the notifier is
           called. It might create wrong order of stacked patches, see below
           for an example.
      
         2. New patch could still see the module in the GOING state even after
            the notifier has been called. It will try to initialize the related
            object structures but the module could disappear at any time. There
            will stay mess in the structures. It might even cause an invalid
            memory access.
      
      This patch solves the problem by adding a boolean variable into struct module.
      The value is true after the coming and before the going handler is called.
      New patches need to be applied when the value is true and they need to ignore
      the module when the value is false.
      
      Note that we need to know state of all modules on the system. The races are
      related to new patches. Therefore we do not know what modules will get
      patched.
      
      Also note that we could not simply ignore going modules. The code from the
      module could be called even in the GOING state until mod->exit() finishes.
      If we start supporting patches with semantic changes between function
      calls, we need to apply new patches to any still usable code.
      See below for an example.
      
      Finally note that the patch solves only the situation when a new patch is
      registered. There are no such problems when the patch is being removed.
      It does not matter who disable the patch first, whether the normal
      disable_patch() or the module notifier. There is nothing to do
      once the patch is disabled.
      
      Alternative solutions:
      ======================
      
      + reject new patches when a patched module is coming or going; this is ugly
      
      + wait with adding new patch until the module leaves the COMING and GOING
        states; this might be dangerous and complicated; we would need to release
        kgr_lock in the middle of the patch registration to avoid a deadlock
        with the coming and going handlers; also we might need a waitqueue for
        each module which seems to be even bigger overhead than the boolean
      
      + stop modules from entering COMING and GOING states; wait until modules
        leave these states when they are already there; looks complicated; we would
        need to ignore the module that asked to stop the others to avoid a deadlock;
        also it is unclear what to do when two modules asked to stop others and
        both are in COMING state (situation when two new patches are applied)
      
      + always register/enable new patches and fix up the potential mess (registered
        patches order) in klp_module_init(); this is nasty and prone to regressions
        in the future development
      
      + add another MODULE_STATE where the kallsyms are visible but the module is not
        used yet; this looks too complex; the module states are checked on "many"
        locations
      
      Example of patch stacking breakage:
      ===================================
      
      The notifier could _not_ _simply_ ignore already initialized module objects.
      For example, let's have three patches (P1, P2, P3) for functions a() and b()
      where a() is from vmcore and b() is from a module M. Something like:
      
      	a()	b()
      P1	a1()	b1()
      P2	a2()	b2()
      P3	a3()	b3(3)
      
      If you load the module M after all patches are registered and enabled.
      The ftrace ops for function a() and b() has listed the functions in this
      order:
      
      	ops_a->func_stack -> list(a3,a2,a1)
      	ops_b->func_stack -> list(b3,b2,b1)
      
      , so the pointer to b3() is the first and will be used.
      
      Then you might have the following scenario. Let's start with state when patches
      P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
      ops for b() does not exist. Then we get into the following race:
      
      CPU0					CPU1
      
      load_module(M)
      
        complete_formation()
      
        mod->state = MODULE_STATE_COMING;
        mutex_unlock(&module_mutex);
      
      					klp_register_patch(P3);
      					klp_enable_patch(P3);
      
      					# STATE 1
      
        klp_module_notify(M)
          klp_module_notify_coming(P1);
          klp_module_notify_coming(P2);
          klp_module_notify_coming(P3);
      
      					# STATE 2
      
      The ftrace ops for a() and b() then looks:
      
        STATE1:
      
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b3);
      
        STATE2:
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b2,b1,b3);
      
      therefore, b2() is used for the module but a3() is used for vmcore
      because they were the last added.
      
      Example of the race with going modules:
      =======================================
      
      CPU0					CPU1
      
      delete_module()  #SYSCALL
      
         try_stop_module()
           mod->state = MODULE_STATE_GOING;
      
         mutex_unlock(&module_mutex);
      
      					klp_register_patch()
      					klp_enable_patch()
      
      					#save place to switch universe
      
      					b()     # from module that is going
      					  a()   # from core (patched)
      
         mod->exit();
      
      Note that the function b() can be called until we call mod->exit().
      
      If we do not apply patch against b() because it is in MODULE_STATE_GOING,
      it will call patched a() with modified semantic and things might get wrong.
      
      [jpoimboe@redhat.com: use one boolean instead of two]
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.cz>
      Acked-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      8cb2c2dc
  6. Mar 13, 2015
Loading