Skip to content
Snippets Groups Projects
  1. May 18, 2017
  2. May 12, 2017
  3. May 10, 2017
    • Will Deacon's avatar
      perf/callchain: Force USER_DS when invoking perf_callchain_user() · 88b0193d
      Will Deacon authored
      
      Perf can generate and record a user callchain in response to a synchronous
      request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS),
      then we can end up walking the user stack (and dereferencing/saving whatever we
      find there) without the protections usually afforded by checks such as
      access_ok.
      
      Rather than play whack-a-mole with each architecture's stack unwinding
      implementation, fix the root of the problem by ensuring that we force USER_DS
      when invoking perf_callchain_user from the perf core.
      
      Reported-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      88b0193d
  4. May 09, 2017
  5. May 08, 2017
    • Daniel Borkmann's avatar
      bpf: don't let ldimm64 leak map addresses on unprivileged · 0d0e5769
      Daniel Borkmann authored
      
      The patch fixes two things at once:
      
      1) It checks the env->allow_ptr_leaks and only prints the map address to
         the log if we have the privileges to do so, otherwise it just dumps 0
         as we would when kptr_restrict is enabled on %pK. Given the latter is
         off by default and not every distro sets it, I don't want to rely on
         this, hence the 0 by default for unprivileged.
      
      2) Printing of ldimm64 in the verifier log is currently broken in that
         we don't print the full immediate, but only the 32 bit part of the
         first insn part for ldimm64. Thus, fix this up as well; it's okay to
         access, since we verified all ldimm64 earlier already (including just
         constants) through replace_map_fd_with_map_ptr().
      
      Fixes: 1be7f75d ("bpf: enable non-root eBPF programs")
      Fixes: cbd35700 ("bpf: verifier (add ability to receive verification log)")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d0e5769
  6. May 05, 2017
    • Rafael J. Wysocki's avatar
      ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle · eed4d47e
      Rafael J. Wysocki authored
      
      The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ
      during suspend-to-idle transitions and, consequently, any events
      signaled through it wake up the system from that state.  However,
      on some systems some of the events signaled via the ACPI SCI while
      suspended to idle should not cause the system to wake up.  In fact,
      quite often they should just be discarded.
      
      Arguably, systems should not resume entirely on such events, but in
      order to decide which events really should cause the system to resume
      and which are spurious, it is necessary to resume up to the point
      when ACPI SCIs are actually handled and processed, which is after
      executing dpm_resume_noirq() in the system resume path.
      
      For this reasons, add a loop around freeze_enter() in which the
      platforms can process events signaled via multiplexed IRQ lines
      like the ACPI SCI and add suspend-to-idle hooks that can be
      used for this purpose to struct platform_freeze_ops.
      
      In the ACPI case, the ->wake hook is used for checking if the SCI
      has triggered while suspended and deferring the interrupt-induced
      system wakeup until the events signaled through it are actually
      processed sufficiently to decide whether or not the system should
      resume.  In turn, the ->sync hook allows all of the relevant event
      queues to be flushed so as to prevent events from being missed due
      to race conditions.
      
      In addition to that, some ACPI code processing wakeup events needs
      to be modified to use the "hard" version of wakeup triggers, so that
      it will cause a system resume to happen on device-induced wakeup
      events even if the "soft" mechanism to prevent the system from
      suspending is not enabled (that also helps to catch device-induced
      wakeup events occurring during suspend transitions in progress).
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      eed4d47e
    • Daniel Micay's avatar
      stackprotector: Increase the per-task stack canary's random range from 32 bits... · 5ea30e4e
      Daniel Micay authored
      stackprotector: Increase the per-task stack canary's random range from 32 bits to 64 bits on 64-bit platforms
      
      The stack canary is an 'unsigned long' and should be fully initialized to
      random data rather than only 32 bits of random data.
      
      Signed-off-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Acked-by: default avatarArjan van de Ven <arjan@linux.intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Arjan van Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-hardening@lists.openwall.com
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170504133209.3053-1-danielmicay@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5ea30e4e
  7. May 04, 2017
  8. May 03, 2017
    • Michal Hocko's avatar
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko authored
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • Michal Hocko's avatar
      lockdep: allow to disable reclaim lockup detection · 7e784422
      Michal Hocko authored
      The current implementation of the reclaim lockup detection can lead to
      false positives and those even happen and usually lead to tweak the code
      to silence the lockdep by using GFP_NOFS even though the context can use
      __GFP_FS just fine.
      
      See
      
        http://lkml.kernel.org/r/20160512080321.GA18496@dastard
      
      as an example.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.5.0-rc2+ #4 Tainted: G           O
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
      
        (&xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]
      
        {RECLAIM_FS-ON-R} state was registered at:
          mark_held_locks+0x79/0xa0
          lockdep_trace_alloc+0xb3/0x100
          kmem_cache_alloc+0x33/0x230
          kmem_zone_alloc+0x81/0x120 [xfs]
          xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
          __xfs_refcount_find_shared+0x75/0x580 [xfs]
          xfs_refcount_find_shared+0x84/0xb0 [xfs]
          xfs_getbmap+0x608/0x8c0 [xfs]
          xfs_vn_fiemap+0xab/0xc0 [xfs]
          do_vfs_ioctl+0x498/0x670
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x12/0x6f
      
               CPU0
               ----
          lock(&xfs_nondir_ilock_class);
          <Interrupt>
            lock(&xfs_nondir_ilock_class);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/543:
      
        stack backtrace:
        CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
        Call Trace:
         lock_acquire+0xd8/0x1e0
         down_write_nested+0x5e/0xc0
         xfs_ilock+0x177/0x200 [xfs]
         xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
         xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
         evict+0xc5/0x190
         dispose_list+0x39/0x60
         prune_icache_sb+0x4b/0x60
         super_cache_scan+0x14f/0x1a0
         shrink_slab.part.63.constprop.79+0x1e9/0x4e0
         shrink_zone+0x15e/0x170
         kswapd+0x4f1/0xa80
         kthread+0xf2/0x110
         ret_from_fork+0x3f/0x70
      
      To quote Dave:
       "Ignoring whether reflink should be doing anything or not, that's a
        "xfs_refcountbt_init_cursor() gets called both outside and inside
        transactions" lockdep false positive case. The problem here is lockdep
        has seen this allocation from within a transaction, hence a GFP_NOFS
        allocation, and now it's seeing it in a GFP_KERNEL context. Also note
        that we have an active reference to this inode.
      
        So, because the reclaim annotations overload the interrupt level
        detections and it's seen the inode ilock been taken in reclaim
        ("interrupt") context, this triggers a reclaim context warning where
        it thinks it is unsafe to do this allocation in GFP_KERNEL context
        holding the inode ilock..."
      
      This sounds like a fundamental problem of the reclaim lock detection.
      It is really impossible to annotate such a special usecase IMHO unless
      the reclaim lockup detection is reworked completely.  Until then it is
      much better to provide a way to add "I know what I am doing flag" and
      mark problematic places.  This would prevent from abusing GFP_NOFS flag
      which has a runtime effect even on configurations which have lockdep
      disabled.
      
      Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
      skip the current allocation request.
      
      While we are at it also make sure that the radix tree doesn't
      accidentaly override tags stored in the upper part of the gfp_mask.
      
      Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e784422
    • Nikolay Borisov's avatar
      lockdep: teach lockdep about memalloc_noio_save · 6d7225f0
      Nikolay Borisov authored
      Patch series "scope GFP_NOFS api", v5.
      
      This patch (of 7):
      
      Commit 21caf2fc ("mm: teach mm by current context info to not do I/O
      during memory allocation") added the memalloc_noio_(save|restore)
      functions to enable people to modify the MM behavior by disabling I/O
      during memory allocation.
      
      This was further extended in commit 934f3072 ("mm: clear __GFP_FS
      when PF_MEMALLOC_NOIO is set").
      
      memalloc_noio_* functions prevent allocation paths recursing back into
      the filesystem without explicitly changing the flags for every
      allocation site.
      
      However, lockdep hasn't been keeping up with the changes and it entirely
      misses handling the memalloc_noio adjustments.  Instead, it is left to
      the callers of __lockdep_trace_alloc to call the function after they
      have shaven the respective GFP flags which can lead to false positives:
      
        =================================
         [ INFO: inconsistent lock state ]
         4.10.0-nbor #134 Not tainted
         ---------------------------------
         inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
         fsstress/3365 [HC0[0]:SC0[0]:HE1:SE1] takes:
          (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
         {IN-RECLAIM_FS-W} state was registered at:
           __lock_acquire+0x62a/0x17c0
           lock_acquire+0xc5/0x220
           down_write_nested+0x4f/0x90
           xfs_ilock+0x141/0x230
           xfs_reclaim_inode+0x12a/0x320
           xfs_reclaim_inodes_ag+0x2c8/0x4e0
           xfs_reclaim_inodes_nr+0x33/0x40
           xfs_fs_free_cached_objects+0x19/0x20
           super_cache_scan+0x191/0x1a0
           shrink_slab+0x26f/0x5f0
           shrink_node+0xf9/0x2f0
           kswapd+0x356/0x920
           kthread+0x10c/0x140
           ret_from_fork+0x31/0x40
         irq event stamp: 173777
         hardirqs last  enabled at (173777): __local_bh_enable_ip+0x70/0xc0
         hardirqs last disabled at (173775): __local_bh_enable_ip+0x37/0xc0
         softirqs last  enabled at (173776): _xfs_buf_find+0x67a/0xb70
         softirqs last disabled at (173774): _xfs_buf_find+0x5db/0xb70
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&xfs_nondir_ilock_class);
           <Interrupt>
             lock(&xfs_nondir_ilock_class);
      
          *** DEADLOCK ***
      
         4 locks held by fsstress/3365:
          #0:  (sb_writers#10){++++++}, at: mnt_want_write+0x24/0x50
          #1:  (&sb->s_type->i_mutex_key#12){++++++}, at: vfs_setxattr+0x6f/0xb0
          #2:  (sb_internal#2){++++++}, at: xfs_trans_alloc+0xfc/0x140
          #3:  (&xfs_nondir_ilock_class){++++?.}, at: xfs_ilock+0x141/0x230
      
         stack backtrace:
         CPU: 0 PID: 3365 Comm: fsstress Not tainted 4.10.0-nbor #134
         Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
         Call Trace:
          kmem_cache_alloc_node_trace+0x3a/0x2c0
          vm_map_ram+0x2a1/0x510
          _xfs_buf_map_pages+0x77/0x140
          xfs_buf_get_map+0x185/0x2a0
          xfs_attr_rmtval_set+0x233/0x430
          xfs_attr_leaf_addname+0x2d2/0x500
          xfs_attr_set+0x214/0x420
          xfs_xattr_set+0x59/0xb0
          __vfs_setxattr+0x76/0xa0
          __vfs_setxattr_noperm+0x5e/0xf0
          vfs_setxattr+0xae/0xb0
          setxattr+0x15e/0x1a0
          path_setxattr+0x8f/0xc0
          SyS_lsetxattr+0x11/0x20
          entry_SYSCALL_64_fastpath+0x23/0xc6
      
      Let's fix this by making lockdep explicitly do the shaving of respective
      GFP flags.
      
      Fixes: 934f3072 ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
      Link: http://lkml.kernel.org/r/20170306131408.9828-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d7225f0
  9. May 02, 2017
Loading