Skip to content
  1. Mar 22, 2017
  2. Feb 23, 2017
    • Andrea Arcangeli's avatar
      userfaultfd: document _IOR/_IOW · e067eba5
      Andrea Arcangeli authored
      Patch series "userfaultfd tmpfs/hugetlbfs/non-cooperative", v2
      
      These userfaultfd features are finished and are ready for larger
      exposure in -mm and upstream merging.
      
      1) tmpfs non present userfault
      2) hugetlbfs non present userfault
      3) non cooperative userfault for fork/madvise/mremap
      
      qemu development code is already exercising 2) and container postcopy
      live migration needs 3).
      
      1) is not currently used but there's a self test and we know some qemu
      user for various reasons uses tmpfs as backing for KVM so it'll need it
      too to use postcopy live migration with tmpfs memory.
      
      All review feedback from the previous submit has been handled and the
      fixes are included.  There's no outstanding issue AFIK.
      
      Upstream code just did a s/fe/vmf/ conversion in the page faults and
      this has been converted as well incrementally.
      
      In addition to the previous submits, this also wakes up stuck userfaults
      during UFFDIO_UNREGISTER.  The non cooperative testcase actually
      reproduced this problem by getting stuck instead of quitting clean in
      some rare case as it could call UFFDIO_UNREGISTER while some userfault
      could be still in flight.  The other option would have been to keep
      leaving it up to userland to serialize itself and to patch the testcase
      instead but the wakeup during unregister I think is preferable.
      
      David also asked the UFFD_FEATURE_MISSING_HUGETLBFS and
      UFFD_FEATURE_MISSING_SHMEM feature flags to be added so QEMU can avoid
      to probe if the hugetlbfs/shmem missing support is available by calling
      UFFDIO_REGISTER.  QEMU already checks HUGETLBFS_MAGIC with fstatfs so if
      UFFD_FEATURE_MISSING_HUGETLBFS is also set, it knows UFFDIO_REGISTER
      will succeed (or if it fails, it's for some other more concerning
      reason).  There's no reason to worry about adding too many feature
      flags.  There are 64 available and worst case we've to bump the API if
      someday we're really going to run out of them.
      
      The round-trip network latency of hugetlbfs userfaults during postcopy
      live migration is still of the order of dozen milliseconds on 10GBit if
      at 2MB hugepage granularity so it's working perfectly and it should
      provide for higher bandwidth or lower CPU usage (which makes it
      interesting to add an option in the future to support THP granularity
      too for anonymous memory, UFFDIO_COPY would then have to create THP if
      alignment/len allows for it).  1GB hugetlbfs granularity will require
      big changes in hugetlbfs to work so it's deferred for later.
      
      This patch (of 42):
      
      This adds proper documentation (inline) to avoid the risk of further
      misunderstandings about the semantics of _IOW/_IOR and it also reminds
      whoever will bump the UFFDIO_API in the future, to change the two ioctl
      to _IOW.
      
      This was found while implementing strace support for those ioctl,
      otherwise we could have never found it by just reviewing kernel code and
      testing it.
      
      _IOC_READ or _IOC_WRITE alters nothing but the ioctl number itself, so
      it's only worth fixing if the UFFDIO_API is bumped someday.
      
      Link: http://lkml.kernel.org/r/20161216144821.5183-2-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatar"Dmitry V. Levin" <ldv@altlinux.org>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e067eba5
  3. Nov 30, 2016
    • Francis Yan's avatar
      tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING · 1c885808
      Francis Yan authored
      
      
      This patch exports the sender chronograph stats via the socket
      SO_TIMESTAMPING channel. Currently we can instrument how long a
      particular application unit of data was queued in TCP by tracking
      SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
      these sender chronograph stats exported simultaneously along with
      these timestamps allow further breaking down the various sender
      limitation.  For example, a video server can tell if a particular
      chunk of video on a connection takes a long time to deliver because
      TCP was experiencing small receive window. It is not possible to
      tell before this patch without packet traces.
      
      To prepare these stats, the user needs to set
      SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
      while requesting other SOF_TIMESTAMPING TX timestamps. When the
      timestamps are available in the error queue, the stats are returned
      in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
      in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
      TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
      
      Signed-off-by: default avatarFrancis Yan <francisyyan@gmail.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c885808
  4. Oct 17, 2016
    • Dave Hansen's avatar
      generic syscalls: kill cruft from removed pkey syscalls · 71757904
      Dave Hansen authored
      
      
      pkey_set() and pkey_get() were syscalls present in older versions
      of the protection keys patches.  They were fully excised from the
      x86 code, but some cruft was left in the generic syscall code.  The
      C++ comments were intended to help to make it more glaring to me to
      fix them before actually submitting them.  That technique worked,
      but later than I would have liked.
      
      I test-compiled this for arm64.
      
      Fixes: a60f7b69 ("generic syscalls: Wire up memory protection keys syscalls")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: mgorman@techsingularity.net
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71757904
  5. Sep 09, 2016
    • Dave Hansen's avatar
      generic syscalls: Wire up memory protection keys syscalls · a60f7b69
      Dave Hansen authored
      
      
      These new syscalls are implemented as generic code, so enable them for
      architectures like arm64 which use the generic syscall table.
      
      According to Arnd:
      
        Even if the support is x86 specific for the forseeable future, it may be
        good to reserve the number just in case.  The other architecture specific
        syscall lists are usually left to the individual arch maintainers, most a
        lot of the newer architectures share this table.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: mgorman@techsingularity.net
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163018.505A6875@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      a60f7b69
    • Dave Hansen's avatar
      x86/pkeys: Allocation/free syscalls · e8c24d3a
      Dave Hansen authored
      
      
      This patch adds two new system calls:
      
      	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
      	int pkey_free(int pkey);
      
      These implement an "allocator" for the protection keys
      themselves, which can be thought of as analogous to the allocator
      that the kernel has for file descriptors.  The kernel tracks
      which numbers are in use, and only allows operations on keys that
      are valid.  A key which was not obtained by pkey_alloc() may not,
      for instance, be passed to pkey_mprotect().
      
      These system calls are also very important given the kernel's use
      of pkeys to implement execute-only support.  These help ensure
      that userspace can never assume that it has control of a key
      unless it first asks the kernel.  The kernel does not promise to
      preserve PKRU (right register) contents except for allocated
      pkeys.
      
      The 'init_access_rights' argument to pkey_alloc() specifies the
      rights that will be established for the returned pkey.  For
      instance:
      
      	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);
      
      will allocate 'pkey', but also sets the bits in PKRU[1] such that
      writing to 'pkey' is already denied.
      
      The kernel does not prevent pkey_free() from successfully freeing
      in-use pkeys (those still assigned to a memory range by
      pkey_mprotect()).  It would be expensive to implement the checks
      for this, so we instead say, "Just don't do it" since sane
      software will never do it anyway.
      
      Any piece of userspace calling pkey_alloc() needs to be prepared
      for it to fail.  Why?  pkey_alloc() returns the same error code
      (ENOSPC) when there are no pkeys and when pkeys are unsupported.
      They can be unsupported for a whole host of reasons, so apps must
      be prepared for this.  Also, libraries or LD_PRELOADs might steal
      keys before an application gets access to them.
      
      This allocation mechanism could be implemented in userspace.
      Even if we did it in userspace, we would still need additional
      user/kernel interfaces to tell userspace which keys are being
      used by the kernel internally (such as for execute-only
      mappings).  Having the kernel provide this facility completely
      removes the need for these additional interfaces, or having an
      implementation of this in userspace at all.
      
      Note that we have to make changes to all of the architectures
      that do not use mman-common.h because we use the new
      PKEY_DENY_ACCESS/WRITE macros in arch-independent code.
      
      1. PKRU is the Protection Key Rights User register.  It is a
         usermode-accessible register that controls whether writes
         and/or access to each individual pkey is allowed or denied.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: linux-arch@vger.kernel.org
      Cc: Dave Hansen <dave@sr71.net>
      Cc: arnd@arndb.de
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      e8c24d3a
  6. May 04, 2016
    • James Hogan's avatar
      asm-generic: Drop renameat syscall from default list · b0da6d44
      James Hogan authored
      
      
      The newer renameat2 syscall provides all the functionality provided by
      the renameat syscall and adds flags, so future architectures won't need
      to include renameat.
      
      Therefore drop the renameat syscall from the generic syscall list unless
      __ARCH_WANT_RENAMEAT is defined by the architecture's unistd.h prior to
      including asm-generic/unistd.h, and adjust all architectures using the
      generic syscall list to define it so that no in-tree architectures are
      affected.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Acked-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: linux@lists.openrisc.net
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: nios2-dev@lists.rocketboards.org
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      b0da6d44
    • Yury Norov's avatar
      asm-generic: use compat version for preadv2 and pwritev2 · 1f93e9f2
      Yury Norov authored
      
      
      Compat architectures that does not use generic unistd (mips, s390),
      declare compat version in their syscall tables for preadv2 and
      pwritev2. Generic unistd syscall table should do it as well.
      
      [arnd: this initially slipped through the review and an
       incorrect patch got merged. arch/tile/ is the only architecture
       that could be affected for their 32-bit compat mode, every
       other architecture we support today is fine.]
      
      Signed-off-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      1f93e9f2
  7. Apr 23, 2016
  8. Mar 05, 2016
    • Dave Hansen's avatar
      mm/pkeys: Fix siginfo ABI breakage caused by new u64 field · 49cd53bf
      Dave Hansen authored
      Stephen Rothwell reported this linux-next build failure:
      
      	http://lkml.kernel.org/r/20160226164406.065a1ffc@canb.auug.org.au
      
      
      
      ... caused by the Memory Protection Keys patches from the tip tree triggering
      a newly introduced build-time sanity check on an ARM build, because they changed
      the ABI of siginfo in an unexpected way.
      
      If u64 has a natural alignment of 8 bytes (which is the case on most mainstream
      platforms, with the notable exception of x86-32), then the leadup to the
      _sifields union matters:
      
      typedef struct siginfo {
              int si_signo;
              int si_errno;
              int si_code;
      
              union {
      	...
              } _sifields;
      } __ARCH_SI_ATTRIBUTES siginfo_t;
      
      Note how the first 3 fields give us 12 bytes, so _sifields is not 8
      naturally bytes aligned.
      
      Before the _pkey field addition the largest element of _sifields (on
      32-bit platforms) was 32 bits. With the u64 added, the minimum alignment
      requirement increased to 8 bytes on those (rare) 32-bit platforms. Thus
      GCC padded the space after si_code with 4 extra bytes, and shifted all
      _sifields offsets by 4 bytes - breaking the ABI of all of those
      remaining fields.
      
      On 64-bit platforms this problem was hidden due to _sifields already
      having numerous fields with natural 8 bytes alignment (pointers).
      
      To fix this, we replace the u64 with an '__u32'.  The __u32 does not
      increase the minimum alignment requirement of the union, and it is
      also large enough to store the 16-bit pkey we have today on x86.
      
      Reported-by: default avatarStehen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: default avatarStehen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-next@vger.kernel.org
      Fixes: cd0ea35f ("signals, pkeys: Notify userspace about protection key faults")
      Link: http://lkml.kernel.org/r/20160301125451.02C7426D@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      49cd53bf
  9. Feb 26, 2016
    • Tom Herbert's avatar
      net: Facility to report route quality of connected sockets · a87cb3e4
      Tom Herbert authored
      
      
      This patch add the SO_CNX_ADVICE socket option (setsockopt only). The
      purpose is to allow an application to give feedback to the kernel about
      the quality of the network path for a connected socket. The value
      argument indicates the type of quality report. For this initial patch
      the only supported advice is a value of 1 which indicates "bad path,
      please reroute"-- the action taken by the kernel is to call
      dst_negative_advice which will attempt to choose a different ECMP route,
      reset the TX hash for flow label and UDP source port in encapsulation,
      etc.
      
      This facility should be useful for connected UDP sockets where only the
      application can provide any feedback about path quality. It could also
      be useful for TCP applications that have additional knowledge about the
      path outside of the normal TCP control loop.
      
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a87cb3e4
  10. Feb 18, 2016
    • Dave Hansen's avatar
      signals, pkeys: Notify userspace about protection key faults · cd0ea35f
      Dave Hansen authored
      
      
      A protection key fault is very similar to any other access error.
      There must be a VMA, etc...  We even want to take the same action
      (SIGSEGV) that we do with a normal access fault.
      
      However, we do need to let userspace know that something is
      different.  We do this the same way what we did with SEGV_BNDERR
      with Memory Protection eXtensions (MPX): define a new SEGV code:
      SEGV_PKUERR.
      
      We add a siginfo field: si_pkey that reveals to userspace which
      protection key was set on the PTE that we faulted on.  There is
      no other easy way for userspace to figure this out.  They could
      parse smaps but that would be a bit cruel.
      
      We share space with in siginfo with _addr_bnd.  #BR faults from
      MPX are completely separate from page faults (#PF) that trigger
      from protection key violations, so we never need both at the same
      time.
      
      Note that _pkey is a 64-bit value.  The current hardware only
      supports 4-bit protection keys.  We do this because there is
      _plenty_ of space in _sigfault and it is possible that future
      processors would support more than 4 bits of protection keys.
      
      The x86 code to actually fill in the siginfo is in the next
      patch.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210212.3A9B83AC@viggo.jf.intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cd0ea35f
  11. Jan 16, 2016
    • Chen Gang's avatar
      arch/*/include/uapi/asm/mman.h: : let MADV_FREE have same value for all architectures · 21f55b01
      Chen Gang authored
      
      
      For uapi, need try to let all macros have same value, and MADV_FREE is
      added into main branch recently, so need redefine MADV_FREE for it.
      
      At present, '8' can be shared with all architectures, so redefine it to
      '8'.
      
      [sudipm.mukherjee@gmail.com: correct uniform value of MADV_FREE]
      Signed-off-by: default avatarChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      21f55b01
    • Minchan Kim's avatar
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim authored
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308
      
      ) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
  12. Jan 05, 2016
  13. Dec 01, 2015
    • Zach Brown's avatar
      vfs: add copy_file_range syscall and vfs helper · 29732938
      Zach Brown authored
      
      
      Add a copy_file_range() system call for offloading copies between
      regular files.
      
      This gives an interface to underlying layers of the storage stack which
      can copy without reading and writing all the data.  There are a few
      candidates that should support copy offloading in the nearer term:
      
      - btrfs shares extent references with its clone ioctl
      - NFS has patches to add a COPY command which copies on the server
      - SCSI has a family of XCOPY commands which copy in the device
      
      This system call avoids the complexity of also accelerating the creation
      of the destination file by operating on an existing destination file
      descriptor, not a path.
      
      Currently the high level vfs entry point limits copy offloading to files
      on the same mount and super (and not in the same file).  This can be
      relaxed if we get implementations which can copy between file systems
      safely.
      
      Signed-off-by: default avatarZach Brown <zab@redhat.com>
      [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                       Change flags parameter from int to unsigned int,
                       Add function to include/linux/syscalls.h,
                       Check copy len after file open mode,
                       Don't forbid ranges inside the same file,
                       Use rw_verify_area() to veriy ranges,
                       Use file_out rather than file_in,
                       Add COPY_FR_REFLINK flag]
      Signed-off-by: default avatarAnna Schumaker <Anna.Schumaker@Netapp.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      29732938
  14. Nov 06, 2015
    • Eric B Munson's avatar
      mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage · b0f205c2
      Eric B Munson authored
      
      
      The previous patch introduced a flag that specified pages in a VMA should
      be placed on the unevictable LRU, but they should not be made present when
      the area is created.  This patch adds the ability to set this state via
      the new mlock system calls.
      
      We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
      MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
      MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
      When used with MCL_CURRENT, all current mappings will be marked with
      VM_LOCKED | VM_LOCKONFAULT.  When used with MCL_FUTURE, the mm->def_flags
      will be marked with VM_LOCKED | VM_LOCKONFAULT.  When used with both
      MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
      marked with VM_LOCKED | VM_LOCKONFAULT.
      
      Prior to this patch, mlockall() will unconditionally clear the
      mm->def_flags any time it is called without MCL_FUTURE.  This behavior is
      maintained after adding MCL_ONFAULT.  If a call to mlockall(MCL_FUTURE) is
      followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
      new VMAs will be unlocked.  This remains true with or without MCL_ONFAULT
      in either mlockall() invocation.
      
      munlock() will unconditionally clear both vma flags.  munlockall()
      unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
      field.
      
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b0f205c2
    • Eric B Munson's avatar
      mm: mlock: add new mlock system call · a8ca5d0e
      Eric B Munson authored
      
      
      With the refactored mlock code, introduce a new system call for mlock.
      The new call will allow the user to specify what lock states are being
      added.  mlock2 is trivial at the moment, but a follow on patch will add a
      new mlock state making it useful.
      
      Signed-off-by: default avatarEric B Munson <emunson@akamai.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8ca5d0e
  15. Oct 12, 2015
  16. Sep 22, 2015
  17. Sep 11, 2015
    • Mathieu Desnoyers's avatar
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers authored
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  18. Apr 17, 2015
  19. Jan 08, 2015
    • David Drysdale's avatar
      vfs: renumber FMODE_NONOTIFY and add to uniqueness check · 75069f2b
      David Drysdale authored
      
      
      Fix clashing values for O_PATH and FMODE_NONOTIFY on sparc.  The
      clashing O_PATH value was added in commit 5229645b ("vfs: add
      nonconflicting values for O_PATH") but this can't be changed as it is
      user-visible.
      
      FMODE_NONOTIFY is only used internally in the kernel, but it is in the
      same numbering space as the other O_* flags, as indicated by the comment
      at the top of include/uapi/asm-generic/fcntl.h (and its use in
      fs/notify/fanotify/fanotify_user.c).  So renumber it to avoid the clash.
      
      All of this has happened before (commit 12ed2e36: "fanotify:
      FMODE_NONOTIFY and __O_SYNC in sparc conflict"), and all of this will
      happen again -- so update the uniqueness check in fcntl_init() to
      include __FMODE_NONOTIFY.
      
      Signed-off-by: default avatarDavid Drysdale <drysdale@google.com>
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Eric Paris <eparis@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75069f2b
  20. Dec 13, 2014
    • David Drysdale's avatar
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale authored
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Rela...
      51f39a1f
  21. Dec 06, 2014
    • Alexei Starovoitov's avatar
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov authored
      
      
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89aa0758
  22. Nov 17, 2014
  23. Nov 11, 2014
    • Eric Dumazet's avatar
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet authored
      
      
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c8c56e1
  24. Sep 26, 2014
  25. Aug 18, 2014
  26. Aug 05, 2014
    • Theodore Ts'o's avatar
      random: introduce getrandom(2) system call · c6e9d6f3
      Theodore Ts'o authored
      
      
      The getrandom(2) system call was requested by the LibreSSL Portable
      developers.  It is analoguous to the getentropy(2) system call in
      OpenBSD.
      
      The rationale of this system call is to provide resiliance against
      file descriptor exhaustion attacks, where the attacker consumes all
      available file descriptors, forcing the use of the fallback code where
      /dev/[u]random is not available.  Since the fallback code is often not
      well-tested, it is better to eliminate this potential failure mode
      entirely.
      
      The other feature provided by this new system call is the ability to
      request randomness from the /dev/urandom entropy pool, but to block
      until at least 128 bits of entropy has been accumulated in the
      /dev/urandom entropy pool.  Historically, the emphasis in the
      /dev/urandom development has been to ensure that urandom pool is
      initialized as quickly as possible after system boot, and preferably
      before the init scripts start execution.
      
      This is because changing /dev/urandom reads to block represents an
      interface change that could potentially break userspace which is not
      acceptable.  In practice, on most x86 desktop and server systems, in
      general the entropy pool can be initialized before it is needed (and
      in modern kernels, we will printk a warning message if not).  However,
      on an embedded system, this may not be the case.  And so with this new
      interface, we can provide the functionality of blocking until the
      urandom pool has been initialized.  Any userspace program which uses
      this new functionality must take care to assure that if it is used
      during the boot process, that it will not cause the init scripts or
      other portions of the system startup to hang indefinitely.
      
      SYNOPSIS
      	#include <linux/random.h>
      
      	int getrandom(void *buf, size_t buflen, unsigned int flags);
      
      DESCRIPTION
      	The system call getrandom() fills the buffer pointed to by buf
      	with up to buflen random bytes which can be used to seed user
      	space random number generators (i.e., DRBG's) or for other
      	cryptographic uses.  It should not be used for Monte Carlo
      	simulations or other programs/algorithms which are doing
      	probabilistic sampling.
      
      	If the GRND_RANDOM flags bit is set, then draw from the
      	/dev/random pool instead of the /dev/urandom pool.  The
      	/dev/random pool is limited based on the entropy that can be
      	obtained from environmental noise, so if there is insufficient
      	entropy, the requested number of bytes may not be returned.
      	If there is no entropy available at all, getrandom(2) will
      	either block, or return an error with errno set to EAGAIN if
      	the GRND_NONBLOCK bit is set in flags.
      
      	If the GRND_RANDOM bit is not set, then the /dev/urandom pool
      	will be used.  Unlike using read(2) to fetch data from
      	/dev/urandom, if the urandom pool has not been sufficiently
      	initialized, getrandom(2) will block (or return -1 with the
      	errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
      
      	The getentropy(2) system call in OpenBSD can be emulated using
      	the following function:
      
                  int getentropy(void *buf, size_t buflen)
                  {
                          int     ret;
      
                          if (buflen > 256)
                                  goto failure;
                          ret = getrandom(buf, buflen, 0);
                          if (ret < 0)
                                  return ret;
                          if (ret == buflen)
                                  return 0;
                  failure:
                          errno = EIO;
                          return -1;
                  }
      
      RETURN VALUE
             On success, the number of bytes that was filled in the buf is
             returned.  This may not be all the bytes requested by the
             caller via buflen if insufficient entropy was present in the
             /dev/random pool, or if the system call was interrupted by a
             signal.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
      	EINVAL		An invalid flag was passed to getrandom(2)
      
      	EFAULT		buf is outside the accessible address space.
      
      	EAGAIN		The requested entropy was not available, and
      			getentropy(2) would have blocked if the
      			GRND_NONBLOCK flag was not set.
      
      	EINTR		While blocked waiting for entropy, the call was
      			interrupted by a signal handler; see the description
      			of how interrupted read(2) calls on "slow" devices
      			are handled with and without the SA_RESTART flag
      			in the signal(7) man page.
      
      NOTES
      	For small requests (buflen <= 256) getrandom(2) will not
      	return EINTR when reading from the urandom pool once the
      	entropy pool has been initialized, and it will return all of
      	the bytes that have been requested.  This is the recommended
      	way to use getrandom(2), and is designed for compatibility
      	with OpenBSD's getentropy() system call.
      
      	However, if you are using GRND_RANDOM, then getrandom(2) may
      	block until the entropy accounting determines that sufficient
      	environmental noise has been gathered such that getrandom(2)
      	will be operating as a NRBG instead of a DRBG for those people
      	who are working in the NIST SP 800-90 regime.  Since it may
      	block for a long time, these guarantees do *not* apply.  The
      	user may want to interrupt a hanging process using a signal,
      	so blocking until all of the requested bytes are returned
      	would be unfriendly.
      
      	For this reason, the user of getrandom(2) MUST always check
      	the return value, in case it returns some error, or if fewer
      	bytes than requested was returned.  In the case of
      	!GRND_RANDOM and small request, the latter should never
      	happen, but the careful userspace code (and all crypto code
      	should be careful) should check for this anyway!
      
      	Finally, unless you are doing long-term key generation (and
      	perhaps not even then), you probably shouldn't be using
      	GRND_RANDOM.  The cryptographic algorithms used for
      	/dev/urandom are quite conservative, and so should be
      	sufficient for all purposes.  The disadvantage of GRND_RANDOM
      	is that it can block, and the increased complexity required to
      	deal with partially fulfilled getrandom(2) requests.
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarZach Brown <zab@zabbo.net>
      c6e9d6f3
  27. Jul 18, 2014
    • Kees Cook's avatar
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook authored
      
      
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@amacapital.net>
      48dc92b9
  28. May 20, 2014
    • James Hogan's avatar
      asm-generic: Add renameat2 syscall · 63ba6000
      James Hogan authored
      
      
      Add the renameat2 syscall to the generic syscall list, which is used by the
      following architectures: arc, arm64, c6x, hexagon, metag, openrisc, score,
      tile, unicore32.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Cc: linux-arch@vger.kernel.org
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      63ba6000
  29. May 14, 2014
    • James Hogan's avatar
      asm-generic: remove _STK_LIM_MAX · ffe6902b
      James Hogan authored
      
      
      _STK_LIM_MAX could be used to override the RLIMIT_STACK hard limit from
      an arch's include/uapi/asm-generic/resource.h file, but is no longer
      used since both parisc and metag removed the override. Therefore remove
      it entirely, setting the hard RLIMIT_STACK limit to RLIM_INFINITY
      directly in include/asm-generic/resource.h.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-arch@vger.kernel.org
      Cc: Helge Deller <deller@gmx.de>
      Cc: John David Anglin <dave.anglin@bell.net>
      ffe6902b
  30. Apr 22, 2014
    • Jeff Layton's avatar
      locks: rename file-private locks to "open file description locks" · 0d3f7a2d
      Jeff Layton authored
      
      
      File-private locks have been merged into Linux for v3.15, and *now*
      people are commenting that the name and macro definitions for the new
      file-private locks suck.
      
      ...and I can't even disagree. The names and command macros do suck.
      
      We're going to have to live with these for a long time, so it's
      important that we be happy with the names before we're stuck with them.
      The consensus on the lists so far is that they should be rechristened as
      "open file description locks".
      
      The name isn't a big deal for the kernel, but the command macros are not
      visually distinct enough from the traditional POSIX lock macros. The
      glibc and documentation folks are recommending that we change them to
      look like F_OFD_{GETLK|SETLK|SETLKW}. That lessens the chance that a
      programmer will typo one of the commands wrong, and also makes it easier
      to spot this difference when reading code.
      
      This patch makes the following changes that I think are necessary before
      v3.15 ships:
      
      1) rename the command macros to their new names. These end up in the uapi
         headers and so are part of the external-facing API. It turns out that
         glibc doesn't actually use the fcntl.h uapi header, but it's hard to
         be sure that something else won't. Changing it now is safest.
      
      2) make the the /proc/locks output display these as type "OFDLCK"
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Carlos O'Donell <carlos@redhat.com>
      Cc: Stefan Metzmacher <metze@samba.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Frank Filz <ffilzlnx@mindspring.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      0d3f7a2d
  31. Apr 07, 2014
  32. Mar 31, 2014
    • Jeff Layton's avatar
      locks: add new fcntl cmd values for handling file private locks · 5d50ffd7
      Jeff Layton authored
      
      
      Due to some unfortunate history, POSIX locks have very strange and
      unhelpful semantics. The thing that usually catches people by surprise
      is that they are dropped whenever the process closes any file descriptor
      associated with the inode.
      
      This is extremely problematic for people developing file servers that
      need to implement byte-range locks. Developers often need a "lock
      management" facility to ensure that file descriptors are not closed
      until all of the locks associated with the inode are finished.
      
      Additionally, "classic" POSIX locks are owned by the process. Locks
      taken between threads within the same process won't conflict with one
      another, which renders them useless for synchronization between threads.
      
      This patchset adds a new type of lock that attempts to address these
      issues. These locks conflict with classic POSIX read/write locks, but
      have semantics that are more like BSD locks with respect to inheritance
      and behavior on close.
      
      This is implemented primarily by changing how fl_owner field is set for
      these locks. Instead of having them owned by the files_struct of the
      process, they are instead owned by the filp on which they were acquired.
      Thus, they are inherited across fork() and are only released when the
      last reference to a filp is put.
      
      These new semantics prevent them from being merged with classic POSIX
      locks, even if they are acquired by the same process. These locks will
      also conflict with classic POSIX locks even if they are acquired by
      the same process or on the same file descriptor.
      
      The new locks are managed using a new set of cmd values to the fcntl()
      syscall. The initial implementation of this converts these values to
      "classic" cmd values at a fairly high level, and the details are not
      exposed to the underlying filesystem. We may eventually want to push
      this handing out to the lower filesystem code but for now I don't
      see any need for it.
      
      Also, note that with this implementation the new cmd values are only
      available via fcntl64() on 32-bit arches. There's little need to
      add support for legacy apps on a new interface like this.
      
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      5d50ffd7
    • J. Bruce Fields's avatar
      locks: fix posix lock range overflow handling · ef12e72a
      J. Bruce Fields authored
      
      
      In the 32-bit case fcntl assigns the 64-bit f_pos and i_size to a 32-bit
      off_t.
      
      The existing range checks also seem to depend on signed arithmetic
      wrapping when it overflows.  In practice maybe that works, but we can be
      more careful.  That also allows us to make a more reliable distinction
      between -EINVAL and -EOVERFLOW.
      
      Note that in the 32-bit case SEEK_CUR or SEEK_END might allow the caller
      to set a lock with starting point no longer representable as a 32-bit
      value.  We could return -EOVERFLOW in such cases, but the locks code is
      capable of handling such ranges, so we choose to be lenient here.  The
      only problem is that subsequent GETLK calls on such a lock will fail
      with EOVERFLOW.
      
      While we're here, do some cleanup including consolidating code for the
      flock and flock64 cases.
      
      Signed-off-by: default avatarJ. Bruce Fields <bfields@fieldses.org>
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      ef12e72a
  33. Mar 04, 2014
    • Heiko Carstens's avatar
      compat: let architectures define __ARCH_WANT_COMPAT_SYS_GETDENTS64 · 0473c9b5
      Heiko Carstens authored
      
      
      For architecture dependent compat syscalls in common code an architecture
      must define something like __ARCH_WANT_<WHATEVER> if it wants to use the
      code.
      This however is not true for compat_sys_getdents64 for which architectures
      must define __ARCH_OMIT_COMPAT_SYS_GETDENTS64 if they do not want the code.
      
      This leads to the situation where all architectures, except mips, get the
      compat code but only x86_64, arm64 and the generic syscall architectures
      actually use it.
      
      So invert the logic, so that architectures actively must do something to
      get the compat code.
      
      This way a couple of architectures get rid of otherwise dead code.
      
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      0473c9b5
  34. Feb 24, 2014
    • James Hogan's avatar
      asm-generic: add sched_setattr/sched_getattr syscalls · e6cfc029
      James Hogan authored
      
      
      Add the sched_setattr and sched_getattr syscalls to the generic syscall
      list, which is used by the following architectures: arc, arm64, c6x,
      hexagon, metag, openrisc, score, tile, unicore32.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: linux-arch@vger.kernel.org
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-metag@vger.kernel.org
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: linux@lists.openrisc.net
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      e6cfc029
  35. Jan 24, 2014
Loading