Skip to content
  1. Jan 14, 2017
    • Jiri Olsa's avatar
      perf/x86/intel: Account interrupts for PEBS errors · 475113d9
      Jiri Olsa authored
      
      
      It's possible to set up PEBS events to get only errors and not
      any data, like on SNB-X (model 45) and IVB-EP (model 62)
      via 2 perf commands running simultaneously:
      
          taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10
      
      This leads to a soft lock up, because the error path of the
      intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
      for error PEBS interrupts, so in case you're getting ONLY
      errors you don't have a way to stop the event when it's over
      the max_samples_per_tick limit:
      
        NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
        ...
        RIP: 0010:[<ffffffff81159232>]  [<ffffffff81159232>] smp_call_function_single+0xe2/0x140
        ...
        Call Trace:
         ? trace_hardirqs_on_caller+0xf5/0x1b0
         ? perf_cgroup_attach+0x70/0x70
         perf_install_in_context+0x199/0x1b0
         ? ctx_resched+0x90/0x90
         SYSC_perf_event_open+0x641/0xf90
         SyS_perf_event_open+0x9/0x10
         do_syscall_64+0x6c/0x1f0
         entry_SYSCALL64_slow_path+0x25/0x25
      
      Add perf_event_account_interrupt() which does the interrupt
      and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
      error path.
      
      We keep the pending_kill and pending_wakeup logic only in the
      __perf_event_overflow() path, because they make sense only if
      there's any data to deliver.
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      475113d9
    • Peter Zijlstra's avatar
      perf/core: Fix concurrent sys_perf_event_open() vs. 'move_group' race · 321027c1
      Peter Zijlstra authored
      
      
      Di Shen reported a race between two concurrent sys_perf_event_open()
      calls where both try and move the same pre-existing software group
      into a hardware context.
      
      The problem is exactly that described in commit:
      
        f63a8daa ("perf: Fix event->ctx locking")
      
      ... where, while we wait for a ctx->mutex acquisition, the event->ctx
      relation can have changed under us.
      
      That very same commit failed to recognise sys_perf_event_context() as an
      external access vector to the events and thereby didn't apply the
      established locking rules correctly.
      
      So while one sys_perf_event_open() call is stuck waiting on
      mutex_lock_double(), the other (which owns said locks) moves the group
      about. So by the time the former sys_perf_event_open() acquires the
      locks, the context we've acquired is stale (and possibly dead).
      
      Apply the established locking rules as per perf_event_ctx_lock_nested()
      to the mutex_lock_double() for the 'move_group' case. This obviously means
      we need to validate state after we acquire the locks.
      
      Reported-by: Di Shen (Keen Lab)
      Tested-by: default avatarJohn Dias <joaodias@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Min Chong <mchong@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Fixes: f63a8daa ("perf: Fix event->ctx locking")
      Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      321027c1
    • Peter Zijlstra's avatar
      perf/core: Fix sys_perf_event_open() vs. hotplug · 63cae12b
      Peter Zijlstra authored
      
      
      There is problem with installing an event in a task that is 'stuck' on
      an offline CPU.
      
      Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
      blocked task doesn't run and doesn't require a CPU etc.. Only on
      wakeup do we ammend the situation and place the task on a available
      CPU.
      
      If we hit such a task with perf_install_in_context() we'll loop until
      either that task wakes up or the CPU comes back online, if the task
      waking depends on the event being installed, we're stuck.
      
      While looking into this issue, I also spotted another problem, if we
      hit a task with perf_install_in_context() that is in the middle of
      being migrated, that is we observe the old CPU before sending the IPI,
      but run the IPI (on the old CPU) while the task is already running on
      the new CPU, things also go sideways.
      
      Rework things to rely on task_curr() -- outside of rq->lock -- which
      is rather tricky. Imagine the following scenario where we're trying to
      install the first event into our task 't':
      
      CPU0            CPU1            CPU2
      
                      (current == t)
      
      t->perf_event_ctxp[] = ctx;
      smp_mb();
      cpu = task_cpu(t);
      
                      switch(t, n);
                                      migrate(t, 2);
                                      switch(p, t);
      
                                      ctx = t->perf_event_ctxp[]; // must not be NULL
      
      smp_function_call(cpu, ..);
      
                      generic_exec_single()
                        func();
                          spin_lock(ctx->lock);
                          if (task_curr(t)) // false
      
                          add_event_to_ctx();
                          spin_unlock(ctx->lock);
      
                                      perf_event_context_sched_in();
                                        spin_lock(ctx->lock);
                                        // sees event
      
      So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
      Because if CPU2's load of that variable were to observe NULL, it would
      not try to schedule the ctx and we'd have a task running without its
      counter, which would be 'bad'.
      
      As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
      first and not see the event yet, then CPU0 must observe task_curr()
      and retry. If the install happens first, then we must see the event on
      sched-in and all is well.
      
      I think we can translate the first part (until the 'must not be NULL')
      of the scenario to a litmus test like:
      
        C C-peterz
      
        {
        }
      
        P0(int *x, int *y)
        {
                int r1;
      
                WRITE_ONCE(*x, 1);
                smp_mb();
                r1 = READ_ONCE(*y);
        }
      
        P1(int *y, int *z)
        {
                WRITE_ONCE(*y, 1);
                smp_store_release(z, 1);
        }
      
        P2(int *x, int *z)
        {
                int r1;
                int r2;
      
                r1 = smp_load_acquire(z);
      	  smp_mb();
                r2 = READ_ONCE(*x);
        }
      
        exists
        (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
      
      Where:
        x is perf_event_ctxp[],
        y is our tasks's CPU, and
        z is our task being placed on the rq of CPU2.
      
      The P0 smp_mb() is the one added by this patch, ordering the store to
      perf_event_ctxp[] from find_get_context() and the load of task_cpu()
      in task_function_call().
      
      The smp_store_release/smp_load_acquire model the RCpc locking of the
      rq->lock and the smp_mb() of P2 is the context switch switching from
      whatever CPU2 was running to our task 't'.
      
      This litmus test evaluates into:
      
        Test C-peterz Allowed
        States 7
        0:r1=0; 2:r1=0; 2:r2=0;
        0:r1=0; 2:r1=0; 2:r2=1;
        0:r1=0; 2:r1=1; 2:r2=1;
        0:r1=1; 2:r1=0; 2:r2=0;
        0:r1=1; 2:r1=0; 2:r2=1;
        0:r1=1; 2:r1=1; 2:r2=0;
        0:r1=1; 2:r1=1; 2:r2=1;
        No
        Witnesses
        Positive: 0 Negative: 7
        Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
        Observation C-peterz Never 0 7
        Hash=e427f41d9146b2a5445101d3e2fcaa34
      
      And the strong and weak model agree.
      
      Reported-by: default avatarMark Rutland <mark.rutland@arm.com>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: jeremy.linton@arm.com
      Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      63cae12b
  2. Jan 12, 2017
  3. Jan 11, 2017
    • Frederic Weisbecker's avatar
      nohz: Fix collision between tick and other hrtimers · 24b91e36
      Frederic Weisbecker authored
      
      
      When the tick is stopped and an interrupt occurs afterward, we check on
      that interrupt exit if the next tick needs to be rescheduled. If it
      doesn't need any update, we don't want to do anything.
      
      In order to check if the tick needs an update, we compare it against the
      clockevent device deadline. Now that's a problem because the clockevent
      device is at a lower level than the tick itself if it is implemented
      on top of hrtimer.
      
      Every hrtimer share this clockevent device. So comparing the next tick
      deadline against the clockevent device deadline is wrong because the
      device may be programmed for another hrtimer whose deadline collides
      with the tick. As a result we may end up not reprogramming the tick
      accidentally.
      
      In a worst case scenario under full dynticks mode, the tick stops firing
      as it is supposed to every 1hz, leaving /proc/stat stalled:
      
            Task in a full dynticks CPU
            ----------------------------
      
            * hrtimer A is queued 2 seconds ahead
            * the tick is stopped, scheduled 1 second ahead
            * tick fires 1 second later
            * on tick exit, nohz schedules the tick 1 second ahead but sees
              the clockevent device is already programmed to that deadline,
              fooled by hrtimer A, the tick isn't rescheduled.
            * hrtimer A is cancelled before its deadline
            * tick never fires again until an interrupt happens...
      
      In order to fix this, store the next tick deadline to the tick_sched
      local structure and reuse that value later to check whether we need to
      reprogram the clock after an interrupt.
      
      On the other hand, ts->sleep_length still wants to know about the next
      clock event and not just the tick, so we want to improve the related
      comment to avoid confusion.
      
      Reported-by: default avatarJames Hartsock <hartsjc@redhat.com>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1483539124-5693-1-git-send-email-fweisbec@gmail.com
      
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      24b91e36
    • Jamie Iles's avatar
      signal: protect SIGNAL_UNKILLABLE from unintentional clearing. · 2d39b3cd
      Jamie Iles authored
      Since commit 00cd5c37 ("ptrace: permit ptracing of /sbin/init") we
      can now trace init processes.  init is initially protected with
      SIGNAL_UNKILLABLE which will prevent fatal signals such as SIGSTOP, but
      there are a number of paths during tracing where SIGNAL_UNKILLABLE can
      be implicitly cleared.
      
      This can result in init becoming stoppable/killable after tracing.  For
      example, running:
      
        while true; do kill -STOP 1; done &
        strace -p 1
      
      and then stopping strace and the kill loop will result in init being
      left in state TASK_STOPPED.  Sending SIGCONT to init will resume it, but
      init will now respond to future SIGSTOP signals rather than ignoring
      them.
      
      Make sure that when setting SIGNAL_STOP_CONTINUED/SIGNAL_STOP_STOPPED
      that we don't clear SIGNAL_UNKILLABLE.
      
      Link: http://lkml.kernel.org/r/20170104122017.25047-1-jamie.iles@oracle.com
      
      
      Signed-off-by: default avatarJamie Iles <jamie.iles@oracle.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d39b3cd
    • Dan Williams's avatar
      mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done} · f931ab47
      Dan Williams authored
      Both arch_add_memory() and arch_remove_memory() expect a single threaded
      context.
      
      For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
      not hold any locks over this check and branch:
      
          if (pgd_val(*pgd)) {
          	pud = (pud_t *)pgd_page_vaddr(*pgd);
          	paddr_last = phys_pud_init(pud, __pa(vaddr),
          				   __pa(vaddr_end),
          				   page_size_mask);
          	continue;
          }
      
          pud = alloc_low_page();
          paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
          			   page_size_mask);
      
      The result is that two threads calling devm_memremap_pages()
      simultaneously can end up colliding on pgd initialization.  This leads
      to crash signatures like the following where the loser of the race
      initializes the wrong pgd entry:
      
          BUG: unable to handle kernel paging request at ffff888ebfff0000
          IP: memcpy_erms+0x6/0x10
          PGD 2f8e8fc067 PUD 0 /* <---- Invalid PUD */
          Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 54 PID: 3818 Comm: systemd-udevd Not tainted 4.6.7+ #13
          task: ffff882fac290040 ti: ffff882f887a4000 task.ti: ffff882f887a4000
          RIP: memcpy_erms+0x6/0x10
          [..]
          Call Trace:
            ? pmem_do_bvec+0x205/0x370 [nd_pmem]
            ? blk_queue_enter+0x3a/0x280
            pmem_rw_page+0x38/0x80 [nd_pmem]
            bdev_read_page+0x84/0xb0
      
      Hold the standard memory hotplug mutex over calls to
      arch_{add,remove}_memory().
      
      Fixes: 41e94a85 ("add devm_memremap_pages")
      Link: http://lkml.kernel.org/r/148357647831.9498.12606007370121652979.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f931ab47
    • Michal Hocko's avatar
      bpf: do not use KMALLOC_SHIFT_MAX · 7984c27c
      Michal Hocko authored
      Commit 01b3f521 ("bpf: fix allocation warnings in bpf maps and
      integer overflow") has added checks for the maximum allocateable size.
      It (ab)used KMALLOC_SHIFT_MAX for that purpose.
      
      While this is not incorrect it is not very clean because we already have
      KMALLOC_MAX_SIZE for this very reason so let's change both checks to use
      KMALLOC_MAX_SIZE instead.
      
      The original motivation for using KMALLOC_SHIFT_MAX was to work around
      an incorrect KMALLOC_MAX_SIZE which could lead to allocation warnings
      but it is no longer needed since "slab: make sure that KMALLOC_MAX_SIZE
      will fit into MAX_ORDER".
      
      Link: http://lkml.kernel.org/r/20161220130659.16461-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7984c27c
  4. Jan 10, 2017
    • Andrei Vagin's avatar
      pid: fix lockdep deadlock warning due to ucount_lock · add7c65c
      Andrei Vagin authored
      
      
      =========================================================
      [ INFO: possible irq lock inversion dependency detected ]
      4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G        W
      ---------------------------------------------------------
      swapper/1/0 just changed the state of lock:
       (&(&sighand->siglock)->rlock){-.....}, at: [<ffffffffbd0a1bc6>] __lock_task_sighand+0xb6/0x2c0
      but this lock took another, HARDIRQ-unsafe lock in the past:
       (ucounts_lock){+.+...}
      and interrupts could create inverse lock ordering between them.
      other info that might help us debug this:
      Chain exists of:                 &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
       Possible interrupt unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(ucounts_lock);
                                     local_irq_disable();
                                     lock(&(&sighand->siglock)->rlock);
                                     lock(&(&tty->ctrl_lock)->rlock);
        <Interrupt>
          lock(&(&sighand->siglock)->rlock);
      
       *** DEADLOCK ***
      
      This patch removes a dependency between rlock and ucount_lock.
      
      Fixes: f333c700 ("pidns: Add a limit on the number of pid namespaces")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrei Vagin <avagin@openvz.org>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      add7c65c
  5. Jan 03, 2017
    • Jan Kara's avatar
      audit: Fix sleep in atomic · be29d20f
      Jan Kara authored
      
      
      Audit tree code was happily adding new notification marks while holding
      spinlocks. Since fsnotify_add_mark() acquires group->mark_mutex this can
      lead to sleeping while holding a spinlock, deadlocks due to lock
      inversion, and probably other fun. Fix the problem by acquiring
      group->mark_mutex earlier.
      
      CC: Paul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      be29d20f
  6. Dec 27, 2016
  7. Dec 25, 2016
  8. Dec 24, 2016
  9. Dec 23, 2016
    • Jan Kara's avatar
      fsnotify: Remove fsnotify_duplicate_mark() · e3ba7307
      Jan Kara authored
      
      
      There are only two calls sites of fsnotify_duplicate_mark(). Those are
      in kernel/audit_tree.c and both are bogus. Vfsmount pointer is unused
      for audit tree, inode pointer and group gets set in
      fsnotify_add_mark_locked() later anyway, mask and free_mark are already
      set in alloc_chunk(). In fact, calling fsnotify_duplicate_mark() is
      actively harmful because following fsnotify_add_mark_locked() will leak
      group reference by overwriting the group pointer. So just remove the two
      calls to fsnotify_duplicate_mark() and the function.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      [PM: line wrapping to fit in 80 chars]
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      e3ba7307
    • Al Viro's avatar
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro authored
      
      
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  10. Dec 21, 2016
  11. Dec 20, 2016
  12. Dec 18, 2016
    • Marcin Nowakowski's avatar
      uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation · 297e765e
      Marcin Nowakowski authored
      
      
      Commit:
      
        72e6ae28 ('ARM: 8043/1: uprobes need icache flush after xol write'
      
      ... has introduced an arch-specific method to ensure all caches are
      flushed appropriately after an instruction is written to an XOL page.
      
      However, when the XOL area is created and the out-of-line breakpoint
      instruction is copied, caches are not flushed at all and stale data may
      be found in icache.
      
      Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
      the arch to ensure all caches are updated accordingly.
      
      This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
      
      Signed-off-by: default avatarMarcin Nowakowski <marcin.nowakowski@imgtec.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Victor Kamensky <victor.kamensky@linaro.org>
      Cc: linux-mips@linux-mips.org
      Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      297e765e
    • Daniel Borkmann's avatar
      bpf: fix mark_reg_unknown_value for spilled regs on map value marking · 6760bf2d
      Daniel Borkmann authored
      
      
      Martin reported a verifier issue that hit the BUG_ON() for his
      test case in the mark_reg_unknown_value() function:
      
        [  202.861380] kernel BUG at kernel/bpf/verifier.c:467!
        [...]
        [  203.291109] Call Trace:
        [  203.296501]  [<ffffffff811364d5>] mark_map_reg+0x45/0x50
        [  203.308225]  [<ffffffff81136558>] mark_map_regs+0x78/0x90
        [  203.320140]  [<ffffffff8113938d>] do_check+0x226d/0x2c90
        [  203.331865]  [<ffffffff8113a6ab>] bpf_check+0x48b/0x780
        [  203.343403]  [<ffffffff81134c8e>] bpf_prog_load+0x27e/0x440
        [  203.355705]  [<ffffffff8118a38f>] ? handle_mm_fault+0x11af/0x1230
        [  203.369158]  [<ffffffff812d8188>] ? security_capable+0x48/0x60
        [  203.382035]  [<ffffffff811351a4>] SyS_bpf+0x124/0x960
        [  203.393185]  [<ffffffff810515f6>] ? __do_page_fault+0x276/0x490
        [  203.406258]  [<ffffffff816db320>] entry_SYSCALL_64_fastpath+0x13/0x94
      
      This issue got uncovered after the fix in a08dd0da ("bpf: fix
      regression on verifier pruning wrt map lookups"). The reason why it
      wasn't noticed before was, because as mentioned in a08dd0da,
      mark_map_regs() was doing the id matching incorrectly based on the
      uncached regs[regno].id. So, in the first loop, we walked all regs
      and as soon as we found regno == i, then this reg's id was cleared
      when calling mark_reg_unknown_value() thus that every subsequent
      register was probed against id of 0 (which, in combination with the
      PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
      register state can hold), and therefore wasn't type transitioned such
      as in the spilled register case for the second loop.
      
      Now since that got fixed, it turned out that 57a09bf0 ("bpf:
      Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
      mark_reg_unknown_value() incorrectly for the spilled regs, and thus
      hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.
      
      Although spilled regs have the same type as the non-spilled regs
      for the verifier state, that is, struct bpf_reg_state, they are
      semantically different from the non-spilled regs. In other words,
      there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
      in the stack, for example, register R<x> could have been spilled by
      the program to stack location X, Y, Z, and in mark_map_regs() we
      need to scan these stack slots of type STACK_SPILL for potential
      registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
      Therefore, depending on the location, the spilled_regs regno can
      be a lot higher than just MAX_BPF_REG's value since we operate on
      stack instead. The reset in mark_reg_unknown_value() itself is
      just fine, only that the BUG_ON() was inappropriate for this. Fix
      it by making a __mark_reg_unknown_value() version that can be
      called from mark_map_reg() generically; we know for the non-spilled
      case that the regno is always < MAX_BPF_REG anyway.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Reported-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6760bf2d
    • Daniel Borkmann's avatar
      bpf: fix overflow in prog accounting · 5ccb071e
      Daniel Borkmann authored
      
      
      Commit aaac3ba9 ("bpf: charge user for creation of BPF maps and
      programs") made a wrong assumption of charging against prog->pages.
      Unlike map->pages, prog->pages are still subject to change when we
      need to expand the program through bpf_prog_realloc().
      
      This can for example happen during verification stage when we need to
      expand and rewrite parts of the program. Should the required space
      cross a page boundary, then prog->pages is not the same anymore as
      its original value that we used to bpf_prog_charge_memlock() on. Thus,
      we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
      is freed eventually. I noticed this that despite having unlimited
      memlock, programs suddenly refused to load with EPERM error due to
      insufficient memlock.
      
      There are two ways to fix this issue. One would be to add a cached
      variable to struct bpf_prog that takes a snapshot of prog->pages at the
      time of charging. The other approach is to also account for resizes. I
      chose to go with the latter for a couple of reasons: i) We want accounting
      rather to be more accurate instead of further fooling limits, ii) adding
      yet another page counter on struct bpf_prog would also be a waste just
      for this purpose. We also do want to charge as early as possible to
      avoid going into the verifier just to find out later on that we crossed
      limits. The only place that needs to be fixed is bpf_prog_realloc(),
      since only here we expand the program, so we try to account for the
      needed delta and should we fail, call-sites check for outcome anyway.
      On cBPF to eBPF migrations, we don't grab a reference to the user as
      they are charged differently. With that in place, my test case worked
      fine.
      
      Fixes: aaac3ba9 ("bpf: charge user for creation of BPF maps and programs")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ccb071e
    • Daniel Borkmann's avatar
      bpf: dynamically allocate digest scratch buffer · aafe6ae9
      Daniel Borkmann authored
      
      
      Geert rightfully complained that 7bd509e3 ("bpf: add prog_digest
      and expose it via fdinfo/netlink") added a too large allocation of
      variable 'raw' from bss section, and should instead be done dynamically:
      
        # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
        add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
        function                                     old     new   delta
        raw                                            -   32832  +32832
        [...]
      
      Since this is only relevant during program creation path, which can be
      considered slow-path anyway, lets allocate that dynamically and be not
      implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
      the beginning of replace_map_fd_with_map_ptr() and also error handling
      stays straight forward.
      
      Reported-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aafe6ae9
  13. Dec 17, 2016
    • Daniel Borkmann's avatar
      bpf: fix regression on verifier pruning wrt map lookups · a08dd0da
      Daniel Borkmann authored
      
      
      Commit 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
      registers") introduced a regression where existing programs stopped
      loading due to reaching the verifier's maximum complexity limit,
      whereas prior to this commit they were loading just fine; the affected
      program has roughly 2k instructions.
      
      What was found is that state pruning couldn't be performed effectively
      anymore due to mismatches of the verifier's register state, in particular
      in the id tracking. It doesn't mean that 57a09bf0 is incorrect per
      se, but rather that verifier needs to perform a lot more work for the
      same program with regards to involved map lookups.
      
      Since commit 57a09bf0 is only about tracking registers with type
      PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
      until they are promoted through pattern matching with a NULL check to
      either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
      id becomes irrelevant for the transitioned types.
      
      For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
      but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
      transferred further into other types that don't make use of it. Among
      others, one example is where UNKNOWN_VALUE is set on function call
      return with RET_INTEGER return type.
      
      states_equal() will then fall through the memcmp() on register state;
      note that the second memcmp() uses offsetofend(), so the id is part of
      that since d2a4dd37 ("bpf: fix state equivalence"). But the bisect
      pointed already to 57a09bf0, where we really reach beyond complexity
      limit. What I found was that states_equal() often failed in this
      case due to id mismatches in spilled regs with registers in type
      PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
      a memcmp() on their reg state and don't have any other optimizations
      in place, therefore also id was relevant in this case for making a
      pruning decision.
      
      We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
      For the affected program, it resulted in a ~17 fold reduction of
      complexity and let the program load fine again. Selftest suite also
      runs fine. The only other place where env->id_gen is used currently is
      through direct packet access, but for these cases id is long living, thus
      a different scenario.
      
      Also, the current logic in mark_map_regs() is not fully correct when
      marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
      reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
      it's id is reset and any subsequent registers that hold the original id
      and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
      anymore, since mark_map_reg() reuses the uncached regs[regno].id that
      was just overridden. Note, we don't need to cache it outside of
      mark_map_regs(), since it's called once on this_branch and the other
      time on other_branch, which are both two independent verifier states.
      A test case for this is added here, too.
      
      Fixes: 57a09bf0 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarThomas Graf <tgraf@suug.ch>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a08dd0da
  14. Dec 15, 2016
    • Geert Uytterhoeven's avatar
      printk: Remove no longer used second struct cont · 8fa9a697
      Geert Uytterhoeven authored
      
      
      If CONFIG_PRINTK=n:
      
          kernel/printk/printk.c:1893: warning: ‘cont’ defined but not used
      
      Note that there are actually two different struct cont definitions and
      objects: the first one is used if CONFIG_PRINTK=y, the second one became
      unused by removing console_cont_flush().
      
      Fixes: 5c2992ee ("printk: remove console flushing special cases for partial buffered lines")
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: default avatarPetr Mladek <pmladek@suse.com>
      [ I do the occasional "allnoconfig" builds, but apparently not often
        enough  - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fa9a697
    • Boris Ostrovsky's avatar
      cpu/hotplug: Clarify description of __cpuhp_setup_state() return value · 512f0980
      Boris Ostrovsky authored
      
      
      When invoked with CPUHP_AP_ONLINE_DYN state __cpuhp_setup_state()
      is expected to return positive value which is the hotplug state that
      the routine assigns.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: linux-pm@vger.kernel.org
      Cc: viresh.kumar@linaro.org
      Cc: bigeasy@linutronix.de
      Cc: rjw@rjwysocki.net
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1481814058-4799-2-git-send-email-boris.ostrovsky@oracle.com
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      512f0980
    • Guilherme G. Piccoli's avatar
      genirq/affinity: Fix node generation from cpumask · c0af5243
      Guilherme G. Piccoli authored
      Commit 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading
      infrastructure") introduced a better IRQ spreading mechanism, taking
      account of the available NUMA nodes in the machine.
      
      Problem is that the algorithm of retrieving the nodemask iterates
      "linearly" based on the number of online nodes - some architectures
      present non-linear node distribution among the nodemask, like PowerPC.
      If this is the case, the algorithm lead to a wrong node count number
      and therefore to a bad/incomplete IRQ affinity distribution.
      
      For example, this problem were found in a machine with 128 CPUs and two
      nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
      distributed). This led to a wrong affinity distribution which then led to
      a bad mq allocation for nvme driver.
      
      Finally, we take the opportunity to fix a comment regarding the affinity
      distribution when we have _more_ nodes than vectors.
      
      Fixes: 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure")
      Reported-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
      Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
      Reviewed-by: Christoph Hellwig <hch@lst.de>
      Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
      Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
      Cc: linux-pci@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: hch@lst.de
      Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com
      Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
      c0af5243
    • Thomas Gleixner's avatar
      tick/broadcast: Prevent NULL pointer dereference · c1a9eeb9
      Thomas Gleixner authored
      
      
      When a disfunctional timer, e.g. dummy timer, is installed, the tick core
      tries to setup the broadcast timer.
      
      If no broadcast device is installed, the kernel crashes with a NULL pointer
      dereference in tick_broadcast_setup_oneshot() because the function has no
      sanity check.
      
      Reported-by: default avatarMason <slash.tmp@free.fr>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: Richard Cochran <rcochran@linutronix.de>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>,
      Cc: Sebastian Frias <sf84@laposte.net>
      Cc: Thibaud Cornic <thibaud_cornic@sigmadesigns.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
      c1a9eeb9
    • Linus Torvalds's avatar
      printk: remove console flushing special cases for partial buffered lines · 5c2992ee
      Linus Torvalds authored
      
      
      It actively hurts proper merging, and makes for a lot of special cases.
      There was a good(ish) reason for doing it originally, but it's getting
      too painful to maintain.  And most of the original reasons for it are
      long gone.
      
      So instead of having special code to flush partial lines to the console
      (as opposed to the record buffers), do _all_ the console writing from
      the record buffer, and be done with it.
      
      If an oops happens (or some other synchronous event), we will flush the
      partial lines due to the oops printing activity, so this does not affect
      that.  It does mean that if you have a completely hung machine, a
      partial preceding line may not have been printed out.
      
      That was some of the original reason for this complexity, in fact, back
      when we used to test for the historical i386 "halt" instruction problem
      by doing
      
      	pr_info("Checking 'hlt' instruction... ");
      
      	if (!boot_cpu_data.hlt_works_ok) {
      		pr_cont("disabled\n");
      		return;
      	}
      	halt();
      	halt();
      	halt();
      	halt();
      	pr_cont("OK\n");
      
      and that model no longer works (it the 'hlt' instruction kills the
      machine, the partial line won't have been flushed, so you won't even see
      it).
      
      Of course, that was also back in the days when people actually had
      textual console output rather than a graphical splash-screen at bootup.
      How times change..
      
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Tested-by: default avatarPetr Mladek <pmladek@suse.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5c2992ee
    • Linus Torvalds's avatar
      printk: remove games with previous record flags · 5aa068ea
      Linus Torvalds authored
      
      
      The record logging code looks at the previous record flags in various
      ways, and they are all wrong.
      
      You can't use the previous record flags to determine anything about the
      next record, because they may simply not be related.  In particular, the
      reason the previous record was a continuation record may well be exactly
      _because_ the new record was printed by a different process, which is
      why the previous record was flushed.
      
      So all those games are simply wrong, and make the code hard to
      understand (because the code fundamentally cdoes not make sense).
      
      So remove it.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5aa068ea
    • Lorenzo Stoakes's avatar
      mm: add locked parameter to get_user_pages_remote() · 5b56d49f
      Lorenzo Stoakes authored
      Patch series "mm: unexport __get_user_pages_unlocked()".
      
      This patch series continues the cleanup of get_user_pages*() functions
      taking advantage of the fact we can now pass gup_flags as we please.
      
      It firstly adds an additional 'locked' parameter to
      get_user_pages_remote() to allow for its callers to utilise
      VM_FAULT_RETRY functionality.  This is necessary as the invocation of
      __get_user_pages_unlocked() in process_vm_rw_single_vec() makes use of
      this and no other existing higher level function would allow it to do
      so.
      
      Secondly existing callers of __get_user_pages_unlocked() are replaced
      with the appropriate higher-level replacement -
      get_user_pages_unlocked() if the current task and memory descriptor are
      referenced, or get_user_pages_remote() if other task/memory descriptors
      are referenced (having acquiring mmap_sem.)
      
      This patch (of 2):
      
      Add a int *locked parameter to get_user_pages_remote() to allow
      VM_FAULT_RETRY faulting behaviour similar to get_user_pages_[un]locked().
      
      Taking into account the previous adjustments to get_user_pages*()
      functions allowing for the passing of gup_flags, we are now in a
      position where __get_user_pages_unlocked() need only be exported for his
      ability to allow VM_FAULT_RETRY behaviour, this adjustment allows us to
      subsequently unexport __get_user_pages_unlocked() as well as allowing
      for future flexibility in the use of get_user_pages_remote().
      
      [sfr@canb.auug.org.au: merge fix for get_user_pages_remote API change]
        Link: http://lkml.kernel.org/r/20161122210511.024ec341@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20161027095141.2569-2-lstoakes@gmail.com
      
      
      Signed-off-by: default avatarLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b56d49f
    • Babu Moger's avatar
      kernel/watchdog.c: move hardlockup detector to separate file · 73ce0511
      Babu Moger authored
      Separate hardlockup code from watchdog.c and move it to watchdog_hld.c.
      It is mostly straight forward.  Remove everything inside
      CONFIG_HARDLOCKUP_DETECTORS.  This code will go to file watchdog_hld.c.
      Also update the makefile accordigly.
      
      Link: http://lkml.kernel.org/r/1478034826-43888-3-git-send-email-babu.moger@oracle.com
      
      
      Signed-off-by: default avatarBabu Moger <babu.moger@oracle.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73ce0511
    • Babu Moger's avatar
      kernel/watchdog.c: move shared definitions to nmi.h · 249e52e3
      Babu Moger authored
      Patch series "Clean up watchdog handlers", v2.
      
      This is an attempt to cleanup watchdog handlers.  Right now,
      kernel/watchdog.c implements both softlockup and hardlockup detectors.
      Softlockup code is generic.  Hardlockup code is arch specific.  Some
      architectures don't use hardlockup detectors.  They use their own
      watchdog detectors.  To make both these combination work, we have
      numerous #ifdefs in kernel/watchdog.c.
      
      We are trying here to make these handlers independent of each other.
      Also provide an interface for architectures to implement their own
      handlers.  watchdog_nmi_enable and watchdog_nmi_disable will be defined
      as weak such that architectures can override its definitions.
      
      Thanks to Don Zickus for his suggestions.
      Here are our previous discussions
      http://www.spinics.net/lists/sparclinux/msg16543.html
      http://www.spinics.net/lists/sparclinux/msg16441.html
      
      This patch (of 3):
      
      Move shared macros and definitions to nmi.h so that watchdog.c, new file
      watchdog_hld.c or any other architecture specific handler can use those
      definitions.
      
      Link: http://lkml.kernel.org/r/1478034826-43888-2-git-send-email-babu.moger@oracle.com
      
      
      Signed-off-by: default avatarBabu Moger <babu.moger@oracle.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: Josh Hunt <johunt@akamai.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      249e52e3
    • Nicolas Pitre's avatar
      posix-timers: give lazy compilers some help optimizing code away · b6f8a92c
      Nicolas Pitre authored
      
      
      The OpenRISC compiler (so far) fails to optimize away a large portion of
      code containing a reference to posix_timer_event in alarmtimer.c when
      CONFIG_POSIX_TIMERS is unset.  Let's give it a direct clue to let the
      build succeed.
      
      This fixes
      [linux-next:master 6682/7183] alarmtimer.c:undefined reference to `posix_timer_event'
      reported by kbuild test robot.
      
      Signed-off-by: default avatarNicolas Pitre <nico@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Triplett <josh@joshtriplett.org>
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b6f8a92c
    • Petr Mladek's avatar
      kdb: call vkdb_printf() from vprintk_default() only when wanted · 34aaff40
      Petr Mladek authored
      kdb_trap_printk allows to pass normal printk() messages to kdb via
      vkdb_printk().  For example, it is used to get backtrace using the
      classic show_stack(), see kdb_show_stack().
      
      vkdb_printf() tries to avoid a potential infinite loop by disabling the
      trap.  But this approach is racy, for example:
      
      CPU1					CPU2
      
      vkdb_printf()
        // assume that kdb_trap_printk == 0
        saved_trap_printk = kdb_trap_printk;
        kdb_trap_printk = 0;
      
      					kdb_show_stack()
      					  kdb_trap_printk++;
      
      Problem1: Now, a nested printk() on CPU0 calls vkdb_printf()
      	  even when it should have been disabled. It will not
      	  cause a deadlock but...
      
         // using the outdated saved value: 0
         kdb_trap_printk = saved_trap_printk;
      
      					  kdb_trap_printk--;
      
      Problem2: Now, kdb_trap_printk == -1 and will stay like this.
         It means that all messages will get passed to kdb from
         now on.
      
      This patch removes the racy saved_trap_printk handling.  Instead, the
      recursion is prevented by a check for the locked CPU.
      
      The solution is still kind of racy.  A non-related printk(), from
      another process, might get trapped by vkdb_printf().  And the wanted
      printk() might not get trapped because kdb_printf_cpu is assigned.  But
      this problem existed even with the original code.
      
      A proper solution would be to get_cpu() before setting kdb_trap_printk
      and trap messages only from this CPU.  I am not sure if it is worth the
      effort, though.
      
      In fact, the race is very theoretical.  When kdb is running any of the
      commands that use kdb_trap_printk there is a single active CPU and the
      other CPUs should be in a holding pen inside kgdb_cpu_enter().
      
      The only time this is violated is when there is a timeout waiting for
      the other CPUs to report to the holding pen.
      
      Finally, note that the situation is a bit schizophrenic.  vkdb_printf()
      explicitly allows recursion but only from KDB code that calls
      kdb_printf() directly.  On the other hand, the generic printk()
      recursion is not allowed because it might cause an infinite loop.  This
      is why we could not hide the decision inside vkdb_printf() easily.
      
      Link: http://lkml.kernel.org/r/1480412276-16690-4-git-send-email-pmladek@suse.com
      
      
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34aaff40
    • Petr Mladek's avatar
      kdb: properly synchronize vkdb_printf() calls with other CPUs · d5d8d3d0
      Petr Mladek authored
      kdb_printf_lock does not prevent other CPUs from entering the critical
      section because it is ignored when KDB_STATE_PRINTF_LOCK is set.
      
      The problematic situation might look like:
      
      CPU0					CPU1
      
      vkdb_printf()
        if (!KDB_STATE(PRINTF_LOCK))
          KDB_STATE_SET(PRINTF_LOCK);
          spin_lock_irqsave(&kdb_printf_lock, flags);
      
      					vkdb_printf()
      					  if (!KDB_STATE(PRINTF_LOCK))
      
      BANG: The PRINTF_LOCK state is set and CPU1 is entering the critical
      section without spinning on the lock.
      
      The problem is that the code tries to implement locking using two state
      variables that are not handled atomically.  Well, we need a custom
      locking because we want to allow reentering the critical section on the
      very same CPU.
      
      Let's use solution from Petr Zijlstra that was proposed for a similar
      scenario, see
      https://lkml.kernel.org/r/20161018171513.734367391@infradead.org
      
      This patch uses the same trick with cmpxchg().  The only difference is
      that we want to handle only recursion from the same context and
      therefore we disable interrupts.
      
      In addition, KDB_STATE_PRINTF_LOCK is removed.  In fact, we are not able
      to set it a non-racy way.
      
      Link: http://lkml.kernel.org/r/1480412276-16690-3-git-send-email-pmladek@suse.com
      
      
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Reviewed-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5d8d3d0
Loading