Skip to content
  1. Nov 05, 2018
  2. Nov 02, 2018
  3. Nov 01, 2018
  4. Oct 31, 2018
  5. Oct 30, 2018
  6. Oct 29, 2018
  7. Oct 26, 2018
    • Roman Gushchin's avatar
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 7a1adfdd
      Roman Gushchin authored
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a1adfdd
    • Alexander Duyck's avatar
      mm: provide kernel parameter to allow disabling page init poisoning · f682a97a
      Alexander Duyck authored
      Patch series "Address issues slowing persistent memory initialization", v5.
      
      The main thing this patch set achieves is that it allows us to initialize
      each node worth of persistent memory independently.  As a result we reduce
      page init time by about 2 minutes because instead of taking 30 to 40
      seconds per node and going through each node one at a time, we process all
      4 nodes in parallel in the case of a 12TB persistent memory setup spread
      evenly over 4 nodes.
      
      This patch (of 3):
      
      On systems with a large amount of memory it can take a significant amount
      of time to initialize all of the page structs with the PAGE_POISON_PATTERN
      value.  I have seen it take over 2 minutes to initialize a system with
      over 12TB of RAM.
      
      In order to work around the issue I had to disable CONFIG_DEBUG_VM and
      then the boot time returned to something much more reasonable as the
      arch_add_memory call completed in milliseconds versus seconds.  However in
      doing that I had to disable all of the other VM debugging on the system.
      
      In order to work around a kernel that might have CONFIG_DEBUG_VM enabled
      on a system that has a large amount of memory I have added a new kernel
      parameter named "vm_debug" that can be set to "-" in order to disable it.
      
      Link: http://lkml.kernel.org/r/20180925201921.3576.84239.stgit@localhost.localdomain
      
      
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f682a97a
    • Johannes Weiner's avatar
      psi: cgroup support · 2ce7135a
      Johannes Weiner authored
      On a system that executes multiple cgrouped jobs and independent
      workloads, we don't just care about the health of the overall system, but
      also that of individual jobs, so that we can ensure individual job health,
      fairness between jobs, or prioritize some jobs over others.
      
      This patch implements pressure stall tracking for cgroups.  In kernels
      with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
      and io.pressure files that track aggregate pressure stall times for only
      the tasks inside the cgroup.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2ce7135a
    • Johannes Weiner's avatar
      psi: pressure stall information for CPU, memory, and IO · eb414681
      Johannes Weiner authored
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb414681
    • Vlastimil Babka's avatar
      mm, proc: add KReclaimable to /proc/meminfo · 61f94e18
      Vlastimil Babka authored
      The vmstat NR_KERNEL_MISC_RECLAIMABLE counter is for kernel non-slab
      allocations that can be reclaimed via shrinker.  In /proc/meminfo, we can
      show the sum of all reclaimable kernel allocations (including slab) as
      "KReclaimable".  Add the same counter also to per-node meminfo under /sys
      
      With this counter, users will have more complete information about kernel
      memory usage.  Non-slab reclaimable pages (currently just the ION
      allocator) will not be missing from /proc/meminfo, making users wonder
      where part of their memory went.  More precisely, they already appear in
      MemAvailable, but without the new counter, it's not obvious why the value
      in MemAvailable doesn't fully correspond with the sum of other counters
      participating in it.
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-6-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61f94e18
    • Matthew Wilcox's avatar
      mm: remove references to vm_insert_pfn() · 67fa1666
      Matthew Wilcox authored
      Documentation and comments.
      
      Link: http://lkml.kernel.org/r/20180828145728.11873-7-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67fa1666
    • Aaron Tomlin's avatar
      slub: extend slub debug to handle multiple slabs · c5fd3ca0
      Aaron Tomlin authored
      Extend the slub_debug syntax to "slub_debug=<flags>[,<slub>]*", where
      <slub> may contain an asterisk at the end.  For example, the following
      would poison all kmalloc slabs:
      
      	slub_debug=P,kmalloc*
      
      and the following would apply the default flags to all kmalloc and all
      block IO slabs:
      
      	slub_debug=,bio*,kmalloc*
      
      Please note that a similar patch was posted by Iliyan Malchev some time
      ago but was never merged:
      
      	https://marc.info/?l=linux-mm&m=131283905330474&w=2
      
      Link: http://lkml.kernel.org/r/20180928111139.27962-1-atomlin@redhat.com
      
      
      Signed-off-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Iliyan Malchev <malchev@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5fd3ca0
    • David Howells's avatar
      KEYS: Implement PKCS#8 RSA Private Key parser [ver #2] · 3c58b236
      David Howells authored
      
      
      Implement PKCS#8 RSA Private Key format [RFC 5208] parser for the
      asymmetric key type.  For the moment, this will only support unencrypted
      DER blobs.  PEM and decryption can be added later.
      
      PKCS#8 keys can be loaded like this:
      
      	openssl pkcs8 -in private_key.pem -topk8 -nocrypt -outform DER | \
      	  keyctl padd asymmetric foo @s
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Tested-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      3c58b236
    • David Howells's avatar
      KEYS: Provide missing asymmetric key subops for new key type ops [ver #2] · 5a307718
      David Howells authored
      
      
      Provide the missing asymmetric key subops for new key type ops.  This
      include query, encrypt, decrypt and create signature.  Verify signature
      already exists.  Also provided are accessor functions for this:
      
      	int query_asymmetric_key(const struct key *key,
      				 struct kernel_pkey_query *info);
      
      	int encrypt_blob(struct kernel_pkey_params *params,
      			 const void *data, void *enc);
      	int decrypt_blob(struct kernel_pkey_params *params,
      			 const void *enc, void *data);
      	int create_signature(struct kernel_pkey_params *params,
      			     const void *data, void *enc);
      
      The public_key_signature struct gains an encoding field to carry the
      encoding for verify_signature().
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Tested-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      5a307718
    • David Howells's avatar
      KEYS: Provide keyctls to drive the new key type ops for asymmetric keys [ver #2] · 00d60fd3
      David Howells authored
      
      
      Provide five keyctl functions that permit userspace to make use of the new
      key type ops for accessing and driving asymmetric keys.
      
       (*) Query an asymmetric key.
      
      	long keyctl(KEYCTL_PKEY_QUERY,
      		    key_serial_t key, unsigned long reserved,
      		    struct keyctl_pkey_query *info);
      
           Get information about an asymmetric key.  The information is returned
           in the keyctl_pkey_query struct:
      
      	__u32	supported_ops;
      
           A bit mask of flags indicating which ops are supported.  This is
           constructed from a bitwise-OR of:
      
      	KEYCTL_SUPPORTS_{ENCRYPT,DECRYPT,SIGN,VERIFY}
      
      	__u32	key_size;
      
           The size in bits of the key.
      
      	__u16	max_data_size;
      	__u16	max_sig_size;
      	__u16	max_enc_size;
      	__u16	max_dec_size;
      
           The maximum sizes in bytes of a blob of data to be signed, a signature
           blob, a blob to be encrypted and a blob to be decrypted.
      
           reserved must be set to 0.  This is intended for future use to hand
           over one or more passphrases needed unlock a key.
      
           If successful, 0 is returned.  If the key is not an asymmetric key,
           EOPNOTSUPP is returned.
      
       (*) Encrypt, decrypt, sign or verify a blob using an asymmetric key.
      
      	long keyctl(KEYCTL_PKEY_ENCRYPT,
      		    const struct keyctl_pkey_params *params,
      		    const char *info,
      		    const void *in,
      		    void *out);
      
      	long keyctl(KEYCTL_PKEY_DECRYPT,
      		    const struct keyctl_pkey_params *params,
      		    const char *info,
      		    const void *in,
      		    void *out);
      
      	long keyctl(KEYCTL_PKEY_SIGN,
      		    const struct keyctl_pkey_params *params,
      		    const char *info,
      		    const void *in,
      		    void *out);
      
      	long keyctl(KEYCTL_PKEY_VERIFY,
      		    const struct keyctl_pkey_params *params,
      		    const char *info,
      		    const void *in,
      		    const void *in2);
      
           Use an asymmetric key to perform a public-key cryptographic operation
           a blob of data.
      
           The parameter block pointed to by params contains a number of integer
           values:
      
      	__s32		key_id;
      	__u32		in_len;
      	__u32		out_len;
      	__u32		in2_len;
      
           For a given operation, the in and out buffers are used as follows:
      
      	Operation ID		in,in_len	out,out_len	in2,in2_len
      	=======================	===============	===============	===========
      	KEYCTL_PKEY_ENCRYPT	Raw data	Encrypted data	-
      	KEYCTL_PKEY_DECRYPT	Encrypted data	Raw data	-
      	KEYCTL_PKEY_SIGN	Raw data	Signature	-
      	KEYCTL_PKEY_VERIFY	Raw data	-		Signature
      
           info is a string of key=value pairs that supply supplementary
           information.
      
           The __spare space in the parameter block must be set to 0.  This is
           intended, amongst other things, to allow the passing of passphrases
           required to unlock a key.
      
           If successful, encrypt, decrypt and sign all return the amount of data
           written into the output buffer.  Verification returns 0 on success.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Tested-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      00d60fd3
    • David Howells's avatar
      KEYS: Provide key type operations for asymmetric key ops [ver #2] · 70025f84
      David Howells authored
      
      
      Provide five new operations in the key_type struct that can be used to
      provide access to asymmetric key operations.  These will be implemented for
      the asymmetric key type in a later patch and may refer to a key retained in
      RAM by the kernel or a key retained in crypto hardware.
      
           int (*asym_query)(const struct kernel_pkey_params *params,
      		       struct kernel_pkey_query *info);
           int (*asym_eds_op)(struct kernel_pkey_params *params,
      			const void *in, void *out);
           int (*asym_verify_signature)(struct kernel_pkey_params *params,
      			          const void *in, const void *in2);
      
      Since encrypt, decrypt and sign are identical in their interfaces, they're
      rolled together in the asym_eds_op() operation and there's an operation ID
      in the params argument to distinguish them.
      
      Verify is different in that we supply the data and the signature instead
      and get an error value (or 0) as the only result on the expectation that
      this may well be how a hardware crypto device may work.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Tested-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      Reviewed-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Tested-by: default avatarDenis Kenzior <denkenz@gmail.com>
      Signed-off-by: default avatarJames Morris <james.morris@microsoft.com>
      70025f84
    • Daniel Borkmann's avatar
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · ede95a63
      Daniel Borkmann authored
      
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ede95a63
  8. Oct 25, 2018
  9. Oct 23, 2018
Loading