Skip to content
  1. Oct 20, 2020
  2. Oct 19, 2020
    • Martin KaFai Lau's avatar
      bpf: Enforce id generation for all may-be-null register type · 93c230e3
      Martin KaFai Lau authored
      
      
      The commit af7ec138 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
      introduces RET_PTR_TO_BTF_ID_OR_NULL and
      the commit eaa6bcb7 ("bpf: Introduce bpf_per_cpu_ptr()")
      introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
      Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
      could become PTR_TO_MEM_OR_NULL which is not covered by
      BPF_PROBE_MEM.
      
      The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
      pointer type requires the bpf program to explicitly do a NULL check first.
      After NULL check, the verifier will mark all registers having
      the same reg->id as safe to use.  However, the reg->id
      is not set for those new _OR_NULL return types.  One of the ways
      that may be wrong is, checking NULL for one btf_id typed pointer will
      end up validating all other btf_id typed pointers because
      all of them have id == 0.  The later tests will exercise
      this path.
      
      To fix it and also avoid similar issue in the future, this patch
      moves the id generation logic out of each individual RET type
      test in check_helper_call().  Instead, it does one
      reg_type_may_be_null() test and then do the id generation
      if needed.
      
      This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
      to catch future breakage.
      
      The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
      fine because it just happens that the existing id generation after
      check_ctx_access() has covered it.  It is also using the
      reg_type_may_be_null() to decide if id generation is needed or not.
      
      Fixes: af7ec138 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
      Fixes: eaa6bcb7 ("bpf: Introduce bpf_per_cpu_ptr()")
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com
      93c230e3
    • Tom Rix's avatar
      bpf: Remove unneeded break · 76702a2e
      Tom Rix authored
      
      
      A break is not needed if it is preceded by a return.
      
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20201019173846.1021-1-trix@redhat.com
      76702a2e
  3. Oct 18, 2020
    • Minchan Kim's avatar
      mm/madvise: introduce process_madvise() syscall: an external memory hinting API · ecb8ac8b
      Minchan Kim authored
      There is usecase that System Management Software(SMS) want to give a
      memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
      case of Android, it is the ActivityManagerService.
      
      The information required to make the reclaim decision is not known to the
      app.  Instead, it is known to the centralized userspace
      daemon(ActivityManagerService), and that daemon must be able to initiate
      reclaim on its own without any app involvement.
      
      To solve the issue, this patch introduces a new syscall
      process_madvise(2).  It uses pidfd of an external process to give the
      hint.  It also supports vector address range because Android app has
      thousands of vmas due to zygote so it's totally waste of CPU and power if
      we should call the syscall one by one for each vma.(With testing 2000-vma
      syscall vs 1-vector syscall, it showed 15% performance improvement.  I
      think it would be bigger in real practice because the testing ran very
      cache friendly environment).
      
      Another potential use case for the vector range is to amortize the cost
      ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
      benefit users like TCP receive zerocopy and malloc implementations.  In
      future, we could find more usecases for other advises so let's make it
      happens as API since we introduce a new syscall at this moment.  With
      that, existing madvise(2) user could replace it with process_madvise(2)
      with their own pid if they want to have batch address ranges support
      feature.
      
      ince it could affect other process's address range, only privileged
      process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
      UID) gives it the right to ptrace the process could use it successfully.
      The flag argument is reserved for future use if we need to extend the API.
      
      I think supporting all hints madvise has/will supported/support to
      process_madvise is rather risky.  Because we are not sure all hints make
      sense from external process and implementation for the hint may rely on
      the caller being in the current context so it could be error-prone.  Thus,
      I just limited hints as MADV_[COLD|PAGEOUT] in this patch.
      
      If someone want to add other hints, we could hear the usecase and review
      it for each hint.  It's safer for maintenance rather than introducing a
      buggy syscall but hard to fix it later.
      
      So finally, the API is as follows,
      
            ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                      unsigned long vlen, int advice, unsigned int flags);
      
          DESCRIPTION
            The process_madvise() system call is used to give advice or directions
            to the kernel about the address ranges from external process as well as
            local process. It provides the advice to address ranges of process
            described by iovec and vlen. The goal of such advice is to improve
            system or application performance.
      
            The pidfd selects the process referred to by the PID file descriptor
            specified in pidfd. (See pidofd_open(2) for further information)
      
            The pointer iovec points to an array of iovec structures, defined in
            <sys/uio.h> as:
      
              struct iovec {
                  void *iov_base;         /* starting address */
                  size_t iov_len;         /* number of bytes to be advised */
              };
      
            The iovec describes address ranges beginning at address(iov_base)
            and with size length of bytes(iov_len).
      
            The vlen represents the number of elements in iovec.
      
            The advice is indicated in the advice argument, which is one of the
            following at this moment if the target process specified by pidfd is
            external.
      
              MADV_COLD
              MADV_PAGEOUT
      
            Permission to provide a hint to external process is governed by a
            ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).
      
            The process_madvise supports every advice madvise(2) has if target
            process is in same thread group with calling process so user could
            use process_madvise(2) to extend existing madvise(2) to support
            vector address ranges.
      
          RETURN VALUE
            On success, process_madvise() returns the number of bytes advised.
            This return value may be less than the total number of requested
            bytes, if an error occurred. The caller should check return value
            to determine whether a partial advice occurred.
      
      FAQ:
      
      Q.1 - Why does any external entity have better knowledge?
      
      Quote from Sandeep
      
      "For Android, every application (including the special SystemServer)
      are forked from Zygote.  The reason of course is to share as many
      libraries and classes between the two as possible to benefit from the
      preloading during boot.
      
      After applications start, (almost) all of the APIs end up calling into
      this SystemServer process over IPC (binder) and back to the
      application.
      
      In a fully running system, the SystemServer monitors every single
      process periodically to calculate their PSS / RSS and also decides
      which process is "important" to the user for interactivity.
      
      So, because of how these processes start _and_ the fact that the
      SystemServer is looping to monitor each process, it does tend to *know*
      which address range of the application is not used / useful.
      
      Besides, we can never rely on applications to clean things up
      themselves.  We've had the "hey app1, the system is low on memory,
      please trim your memory usage down" notifications for a long time[1].
      They rely on applications honoring the broadcasts and very few do.
      
      So, if we want to avoid the inevitable killing of the application and
      restarting it, some way to be able to tell the OS about unimportant
      memory in these applications will be useful.
      
      - ssp
      
      Q.2 - How to guarantee the race(i.e., object validation) between when
      giving a hint from an external process and get the hint from the target
      process?
      
      process_madvise operates on the target process's address space as it
      exists at the instant that process_madvise is called.  If the space
      target process can run between the time the process_madvise process
      inspects the target process address space and the time that
      process_madvise is actually called, process_madvise may operate on
      memory regions that the calling process does not expect.  It's the
      responsibility of the process calling process_madvise to close this
      race condition.  For example, the calling process can suspend the
      target process with ptrace, SIGSTOP, or the freezer cgroup so that it
      doesn't have an opportunity to change its own address space before
      process_madvise is called.  Another option is to operate on memory
      regions that the caller knows a priori will be unchanged in the target
      process.  Yet another option is to accept the race for certain
      process_madvise calls after reasoning that mistargeting will do no
      harm.  The suggested API itself does not provide synchronization.  It
      also apply other APIs like move_pages, process_vm_write.
      
      The race isn't really a problem though.  Why is it so wrong to require
      that callers do their own synchronization in some manner?  Nobody
      objects to write(2) merely because it's possible for two processes to
      open the same file and clobber each other's writes --- instead, we tell
      people to use flock or something.  Think about mmap.  It never
      guarantees newly allocated address space is still valid when the user
      tries to access it because other threads could unmap the memory right
      before.  That's where we need synchronization by using other API or
      design from userside.  It shouldn't be part of API itself.  If someone
      needs more fine-grained synchronization rather than process level,
      there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
      applicable via using last reserved argument of the API but I don't
      think it's necessary right now since we have already ways to prevent
      the race so don't want to add additional complexity with more
      fine-grained optimization model.
      
      To make the API extend, it reserved an unsigned long as last argument
      so we could support it in future if someone really needs it.
      
      Q.3 - Why doesn't ptrace work?
      
      Injecting an madvise in the target process using ptrace would not work
      for us because such injected madvise would have to be executed by the
      target process, which means that process would have to be runnable and
      that creates the risk of the abovementioned race and hinting a wrong
      VMA.  Furthermore, we want to act the hint in caller's context, not the
      callee's, because the callee is usually limited in cpuset/cgroups or
      even freezed state so they can't act by themselves quick enough, which
      causes more thrashing/kill.  It doesn't work if the target process are
      ptraced(e.g., strace, debugger, minidump) because a process can have at
      most one ptracer.
      
      [1] https://developer.android.com/topic/performance/memory"
      
      [2] process_getinfo for getting the cookie which is updated whenever
          vma of process address layout are changed - Daniel Colascione -
          https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
      
      [3] anonymous fd which is used for the object(i.e., address range)
          validation - Michal Hocko -
          https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/
      
      [minchan@kernel.org: fix process_madvise build break for arm64]
        Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
      [minchan@kernel.org: fix build error for mips of process_madvise]
        Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
      [akpm@linux-foundation.org: fix patch ordering issue]
      [akpm@linux-foundation.org: fix arm64 whoops]
      [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
      [akpm@linux-foundation.org: fix i386 build]
      [sfr@canb.auug.org.au: fix syscall numbering]
        Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
      [sfr@canb.auug.org.au: madvise.c needs compat.h]
        Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
      [minchan@kernel.org: fix mips build]
        Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
      [yuehaibing@huawei.com: remove duplicate header which is included twice]
        Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
      [minchan@kernel.org: do not use helper functions for process_madvise]
        Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
      [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
      [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
        Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au
      
      
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
      Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ecb8ac8b
    • Minchan Kim's avatar
      pid: move pidfd_get_pid() to pid.c · 1aa92cd3
      Minchan Kim authored
      
      
      process_madvise syscall needs pidfd_get_pid function to translate pidfd to
      pid so this patch move the function to kernel/pid.c.
      
      Suggested-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Dias <joaodias@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aa92cd3
  4. Oct 17, 2020
    • Jens Axboe's avatar
      task_work: cleanup notification modes · 91989c70
      Jens Axboe authored
      
      
      A previous commit changed the notification mode from true/false to an
      int, allowing notify-no, notify-yes, or signal-notify. This was
      backwards compatible in the sense that any existing true/false user
      would translate to either 0 (on notification sent) or 1, the latter
      which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.
      
      Clean this up properly, and define a proper enum for the notification
      mode. Now we have:
      
      - TWA_NONE. This is 0, same as before the original change, meaning no
        notification requested.
      - TWA_RESUME. This is 1, same as before the original change, meaning
        that we use TIF_NOTIFY_RESUME.
      - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
        notification.
      
      Clean up all the callers, switching their 0/1/false/true to using the
      appropriate TWA_* mode for notifications.
      
      Fixes: e91b4816 ("task_work: teach task_work_add() to do signal_wake_up()")
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91989c70
    • Jens Axboe's avatar
      tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume() · 3c532798
      Jens Axboe authored
      
      
      All the callers currently do this, clean it up and move the clearing
      into tracehook_notify_resume() instead.
      
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3c532798
  5. Oct 16, 2020
  6. Oct 15, 2020
  7. Oct 14, 2020
    • Suren Baghdasaryan's avatar
      mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary · 67197a4f
      Suren Baghdasaryan authored
      Currently __set_oom_adj loops through all processes in the system to keep
      oom_score_adj and oom_score_adj_min in sync between processes sharing
      their mm.  This is done for any task with more that one mm_users, which
      includes processes with multiple threads (sharing mm and signals).
      However for such processes the loop is unnecessary because their signal
      structure is shared as well.
      
      Android updates oom_score_adj whenever a tasks changes its role
      (background/foreground/...) or binds to/unbinds from a service, making it
      more/less important.  Such operation can happen frequently.  We noticed
      that updates to oom_score_adj became more expensive and after further
      investigation found out that the patch mentioned in "Fixes" introduced a
      regression.  Using Pixel 4 with a typical Android workload, write time to
      oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
      linearly depends on the number of multi-threaded processes running on the
      system.
      
      Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
      (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
      MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
      update should be synchronized between multiple processes.  To prevent
      races between clone() and __set_oom_adj(), when oom_score_adj of the
      process being cloned might be modified from userspace, we use
      oom_adj_mutex.  Its scope is changed to global.
      
      The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
      the case of vfork().  To prevent performance regressions of vfork(), we
      skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
      specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
      sharing the mm exits) is left out of this patch to keep it simple and
      because it is believed that this threading model is rare.  Should there
      ever be a need for optimizing that case as well, it can be done by hooking
      into the exit path, likely following the mm_update_next_owner pattern.
      
      With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
      quite rare, the regression is gone after the change is applied.
      
      [surenb@google.com: v3]
        Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
      
      
      
      Fixes: 44a70ade ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
      Reported-by: default avatarTim Murray <timmurray@google.com>
      Suggested-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: Adrian Reber <areber@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
      
      
      Debugged-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67197a4f
    • Mike Rapoport's avatar
      dma-contiguous: simplify cma_early_percent_memory() · e9aa36cc
      Mike Rapoport authored
      
      
      The memory size calculation in cma_early_percent_memory() traverses
      memblock.memory rather than simply call memblock_phys_mem_size().  The
      comment in that function suggests that at some point there should have
      been call to memblock_analyze() before memblock_phys_mem_size() could be
      used.  As of now, there is no memblock_analyze() at all and
      memblock_phys_mem_size() can be used as soon as cold-plug memory is
      registered with memblock.
      
      Replace loop over memblock.memory with a call to memblock_phys_mem_size().
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-3-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9aa36cc
    • Peter Xu's avatar
      mm: remove src/dst mm parameter in copy_page_range() · c78f4636
      Peter Xu authored
      Both of the mm pointers are not needed after commit 7a4830c3
      ("mm/fork: Pass new vma pointer into copy_page_range()").
      
      Jason Gunthorpe also reported that the ordering of copy_page_range() is
      odd.  Since working at it, reorder the parameters to be logical, by (1)
      always put the dst_* fields to be before src_* fields, and (2) keep the
      same type of parameters together.
      
      [peterx@redhat.com: further reorder some parameters and line format, per Jason]
        Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com
      [peterx@redhat.com: fix warnings]
        Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1
      
      
      
      Reported-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c78f4636
    • Miaohe Lin's avatar
      mm: use helper function mapping_allow_writable() · cf508b58
      Miaohe Lin authored
      
      
      Commit 4bb5f5d9 ("mm: allow drivers to prevent new writable mappings")
      changed i_mmap_writable from unsigned int to atomic_t and add the helper
      function mapping_allow_writable() to atomic_inc i_mmap_writable.  But it
      forgot to use this helper function in dup_mmap() and __vma_link_file().
      
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Adrian Reber <areber@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cf508b58
    • Dan Williams's avatar
      resource: report parent to walk_iomem_res_desc() callback · 73fb952d
      Dan Williams authored
      
      
      In support of detecting whether a resource might have been been claimed,
      report the parent to the walk_iomem_res_desc() callback.  For example, the
      ACPI HMAT parser publishes "hmem" platform devices per target range.
      However, if the HMAT is disabled / missing a fallback driver can attach
      devices to the raw memory ranges as a fallback if it sees unclaimed /
      orphan "Soft Reserved" resources in the resource tree.
      
      Otherwise, find_next_iomem_res() returns a resource with garbage data from
      the stack allocation in __walk_iomem_res_desc() for the res->parent field.
      
      There are currently no users that expect ->child and ->sibling to be
      valid, and the resource_lock would be needed to traverse them.  Use a
      compound literal to implicitly zero initialize the fields that are not
      being returned in addition to setting ->parent.
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brice Goglin <Brice.Goglin@inria.fr>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Hulk Robot <hulkci@huawei.com>
      Cc: Jason Yan <yanaijie@huawei.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73fb952d
  8. Oct 12, 2020
    • Daniel Jordan's avatar
      module: statically initialize init section freeing data · fdf09ab8
      Daniel Jordan authored
      Corentin hit the following workqueue warning when running with
      CRYPTO_MANAGER_EXTRA_TESTS:
      
        WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
        Modules linked in: ghash_generic
        CPU: 2 PID: 147 Comm: modprobe Not tainted
            5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
        Hardware name: Pine H64 model A (DT)
        pc : __queue_work+0x3b8/0x3d0
        Call trace:
         __queue_work+0x3b8/0x3d0
         queue_work_on+0x6c/0x90
         do_init_module+0x188/0x1f0
         load_module+0x1d00/0x22b0
      
      I wasn't able to reproduce on x86 or rpi 3b+.
      
      This is
      
        WARN_ON(!list_empty(&work->entry))
      
      from __queue_work(), and it happens because the init_free_wq work item
      isn't initialized in time for a crypto test that requests the gcm
      module.  Some crypto tests were recently moved earlier in boot as
      explained in commit c4741b23 ("crypto: run initcalls for generic
      implementations earlier"), which went into mainline less than two weeks
      before the Fixes commit.
      
      Avoid the warning by statically initializing init_free_wq and the
      corresponding llist.
      
      Link: https://lore.kernel.org/lkml/20200217204803.GA13479@Red/
      
      
      Fixes: 1a7b7d92 ("modules: Use vmalloc special flag")
      Reported-by: default avatarCorentin Labbe <clabbe.montjoie@gmail.com>
      Tested-by: default avatarCorentin Labbe <clabbe.montjoie@gmail.com>
      Tested-on: sun50i-h6-pine-h64
      Tested-on: imx8mn-ddr4-evk
      Tested-on: sun50i-a64-bananapi-m64
      Reviewed-by: default avatarEric Biggers <ebiggers@google.com>
      Signed-off-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: default avatarJessica Yu <jeyu@kernel.org>
      fdf09ab8
    • Jiri Olsa's avatar
      perf/core: Fix race in the perf_mmap_close() function · f91072ed
      Jiri Olsa authored
      
      
      There's a possible race in perf_mmap_close() when checking ring buffer's
      mmap_count refcount value. The problem is that the mmap_count check is
      not atomic because we call atomic_dec() and atomic_read() separately.
      
        perf_mmap_close:
        ...
         atomic_dec(&rb->mmap_count);
         ...
         if (atomic_read(&rb->mmap_count))
            goto out_put;
      
         <ring buffer detach>
         free_uid
      
      out_put:
        ring_buffer_put(rb); /* could be last */
      
      The race can happen when we have two (or more) events sharing same ring
      buffer and they go through atomic_dec() and then they both see 0 as refcount
      value later in atomic_read(). Then both will go on and execute code which
      is meant to be run just once.
      
      The code that detaches ring buffer is probably fine to be executed more
      than once, but the problem is in calling free_uid(), which will later on
      demonstrate in related crashes and refcount warnings, like:
      
        refcount_t: addition on 0; use-after-free.
        ...
        RIP: 0010:refcount_warn_saturate+0x6d/0xf
        ...
        Call Trace:
        prepare_creds+0x190/0x1e0
        copy_creds+0x35/0x172
        copy_process+0x471/0x1a80
        _do_fork+0x83/0x3a0
        __do_sys_wait4+0x83/0x90
        __do_sys_clone+0x85/0xa0
        do_syscall_64+0x5b/0x1e0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using atomic decrease and check instead of separated calls.
      
      Tested-by: default avatarMichael Petlan <mpetlan@redhat.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Acked-by: default avatarWade Mealing <wmealing@redhat.com>
      Fixes: 9bb5d40c ("perf: Fix mmap() accounting hole");
      Link: https://lore.kernel.org/r/20200916115311.GE2301783@krava
      f91072ed
Loading