Skip to content
  1. Feb 22, 2018
    • Tony Luck's avatar
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Tony Luck authored
      
      
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      
      Signed-off-by: default avatarTony Luck <tony.luck@intel.com>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  2. Feb 21, 2018
  3. Feb 16, 2018
    • Andy Shevchenko's avatar
      irqdomain: Re-use DEFINE_SHOW_ATTRIBUTE() macro · 0b24a0bb
      Andy Shevchenko authored
      
      
      ...instead of open coding file operations followed by custom ->open()
      callbacks per each attribute.
      
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      0b24a0bb
    • Jessica Yu's avatar
      kprobes: Propagate error from disarm_kprobe_ftrace() · 297f9233
      Jessica Yu authored
      Improve error handling when disarming ftrace-based kprobes. Like with
      arm_kprobe_ftrace(), propagate any errors from disarm_kprobe_ftrace() so
      that we do not disable/unregister kprobes that are still armed. In other
      words, unregister_kprobe() and disable_kprobe() should not report success
      if the kprobe could not be disarmed.
      
      disarm_all_kprobes() keeps its current behavior and attempts to
      disarm all kprobes. It returns the last encountered error and gives a
      warning if not all probes could be disarmed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      
      Based-on-patches-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarJessica Yu <jeyu@kernel.org>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-3-jeyu@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      297f9233
    • Jessica Yu's avatar
      kprobes: Propagate error from arm_kprobe_ftrace() · 12310e34
      Jessica Yu authored
      Improve error handling when arming ftrace-based kprobes. Specifically, if
      we fail to arm a ftrace-based kprobe, register_kprobe()/enable_kprobe()
      should report an error instead of success. Previously, this has lead to
      confusing situations where register_kprobe() would return 0 indicating
      success, but the kprobe would not be functional if ftrace registration
      during the kprobe arming process had failed. We should therefore take any
      errors returned by ftrace into account and propagate this error so that we
      do not register/enable kprobes that cannot be armed. This can happen if,
      for example, register_ftrace_function() finds an IPMODIFY conflict (since
      kprobe_ftrace_ops has this flag set) and returns an error. Such a conflict
      is possible since livepatches also set the IPMODIFY flag for their ftrace_ops.
      
      arm_all_kprobes() keeps its current behavior and attempts to arm all
      kprobes. It returns the last encountered error and gives a warning if
      not all probes could be armed.
      
      This patch is based on Petr Mladek's original patchset (patches 2 and 3)
      back in 2015, which improved kprobes error handling, found here:
      
         https://lkml.org/lkml/2015/2/26/452
      
      
      
      However, further work on this had been paused since then and the patches
      were not upstreamed.
      
      Based-on-patches-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarJessica Yu <jeyu@kernel.org>
      Acked-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S . Miller <davem@davemloft.net>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Joe Lawrence <joe.lawrence@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: live-patching@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180109235124.30886-2-jeyu@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      12310e34
  4. Feb 13, 2018
  5. Feb 11, 2018
    • Linus Torvalds's avatar
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds authored
      
      
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      
      Scripted-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  6. Feb 08, 2018
  7. Feb 07, 2018
  8. Feb 06, 2018
    • John Fastabend's avatar
      bpf: sockmap, fix leaking maps with attached but not detached progs · 3d9e9526
      John Fastabend authored
      
      
      When a program is attached to a map we increment the program refcnt
      to ensure that the program is not removed while it is potentially
      being referenced from sockmap side. However, if this same program
      also references the map (this is a reasonably common pattern in
      my programs) then the verifier will also increment the maps refcnt
      from the verifier. This is to ensure the map doesn't get garbage
      collected while the program has a reference to it.
      
      So we are left in a state where the map holds the refcnt on the
      program stopping it from being removed and releasing the map refcnt.
      And vice versa the program holds a refcnt on the map stopping it
      from releasing the refcnt on the prog.
      
      All this is fine as long as users detach the program while the
      map fd is still around. But, if the user omits this detach command
      we are left with a dangling map we can no longer release.
      
      To resolve this when the map fd is released decrement the program
      references and remove any reference from the map to the program.
      This fixes the issue with possibly dangling map and creates a
      user side API constraint. That is, the map fd must be held open
      for programs to be attached to a map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3d9e9526
    • John Fastabend's avatar
      bpf: sockmap, add sock close() hook to remove socks · 1aa12bdf
      John Fastabend authored
      
      
      The selftests test_maps program was leaving dangling BPF sockmap
      programs around because not all psock elements were removed from
      the map. The elements in turn hold a reference on the BPF program
      they are attached to causing BPF programs to stay open even after
      test_maps has completed.
      
      The original intent was that sk_state_change() would be called
      when TCP socks went through TCP_CLOSE state. However, because
      socks may be in SOCK_DEAD state or the sock may be a listening
      socket the event is not always triggered.
      
      To resolve this use the ULP infrastructure and register our own
      proto close() handler. This fixes the above case.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: default avatarPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1aa12bdf
    • Mel Gorman's avatar
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman authored
      
      
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      32e839dd
    • Mel Gorman's avatar
      sched/fair: Do not migrate if the prev_cpu is idle · 806486c3
      Mel Gorman authored
      
      
      wake_affine_idle() prefers to move a task to the current CPU if the
      wakeup is due to an interrupt. The expectation is that the interrupt
      data is cache hot and relevant to the waking task as well as avoiding
      a search. However, there is no way to determine if there was cache hot
      data on the previous CPU that may exceed the interrupt data. Furthermore,
      round-robin delivery of interrupts can migrate tasks around a socket where
      each CPU is under-utilised.  This can interact badly with cpufreq which
      makes decisions based on per-cpu data. It has been observed on machines
      with HWP that p-states are not boosted to their maximum levels even though
      the workload is latency and throughput sensitive.
      
      This patch uses the previous CPU for the task if it's idle and cache-affine
      with the current CPU even if the current CPU is idle due to the wakup
      being related to the interrupt. This reduces migrations at the cost of
      the interrupt data not being cache hot when the task wakes.
      
      A variety of workloads were tested on various machines and no adverse
      impact was noticed that was outside noise. dbench on ext4 on UMA showed
      roughly 10% reduction in the number of CPU migrations and it is a case
      where interrupts are frequent for IO competions. In most cases, the
      difference in performance is quite small but variability is often
      reduced. For example, this is the result for pgbench running on a UMA
      machine with different numbers of clients.
      
                                4.15.0-rc9             4.15.0-rc9
                                  baseline              waprev-v1
       Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
       Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
       Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
       Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
       Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
       Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
       Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
       Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
       Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
       Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
       CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)
      
      Note that the different in performance is marginal but for low utilisation,
      there is less variability.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      806486c3
    • Mel Gorman's avatar
      sched/fair: Restructure wake_affine*() to return a CPU id · 3b76c4a3
      Mel Gorman authored
      
      
      This is a preparation patch that has wake_affine*() return a CPU ID instead of
      a boolean. The intent is to allow the wake_affine() helpers to be avoided
      if a decision is already made. This patch has no functional change.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3b76c4a3
    • Mel Gorman's avatar
      sched/fair: Remove unnecessary parameters from wake_affine_idle() · 89a55f56
      Mel Gorman authored
      
      
      wake_affine_idle() takes parameters it never uses so clean it up.
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      89a55f56
    • Wen Yang's avatar
      sched/rt: Make update_curr_rt() more accurate · e7ad2031
      Wen Yang authored
      
      
      rq->clock_task may be updated between the two calls of
      rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
      once makes it more accurate and efficient, taking update_curr() as
      reference.
      
      Signed-off-by: default avatarWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cn
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e7ad2031
    • Steven Rostedt (VMware)'s avatar
      sched/rt: Up the root domain ref count when passing it around via IPIs · 364f5665
      Steven Rostedt (VMware) authored
      
      
      When issuing an IPI RT push, where an IPI is sent to each CPU that has more
      than one RT task scheduled on it, it references the root domain's rto_mask,
      that contains all the CPUs within the root domain that has more than one RT
      task in the runable state. The problem is, after the IPIs are initiated, the
      rq->lock is released. This means that the root domain that is associated to
      the run queue could be freed while the IPIs are going around.
      
      Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
      the root domain's ref count respectively. This way when initiating the IPIs,
      the scheduler will up the root domain's ref count before releasing the
      rq->lock, ensuring that the root domain does not go away until the IPI round
      is complete.
      
      Reported-by: default avatarPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      364f5665
    • Steven Rostedt (VMware)'s avatar
      sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() · ad0f1d9d
      Steven Rostedt (VMware) authored
      
      
      When the rto_push_irq_work_func() is called, it looks at the RT overloaded
      bitmask in the root domain via the runqueue (rq->rd). The problem is that
      during CPU up and down, nothing here stops rq->rd from changing between
      taking the rq->rd->rto_lock and releasing it. That means the lock that is
      released is not the same lock that was taken.
      
      Instead of using this_rq()->rd to get the root domain, as the irq work is
      part of the root domain, we can simply get the root domain from the irq work
      that is passed to the routine:
      
       container_of(work, struct root_domain, rto_push_work)
      
      This keeps the root domain consistent.
      
      Reported-by: default avatarPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ad0f1d9d
Loading