Skip to content
  1. Nov 06, 2019
  2. Nov 01, 2019
  3. Oct 31, 2019
    • Eric Dumazet's avatar
      tcp: increase tcp_max_syn_backlog max value · 623d0c2d
      Eric Dumazet authored
      
      
      tcp_max_syn_backlog default value depends on memory size
      and TCP ehash size. Before this patch, the max value
      was 2048 [1], which is considered too small nowadays.
      
      Increase it to 4096 to match the recent SOMAXCONN change.
      
      [1] This is with TCP ehash size being capped to 524288 buckets.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      623d0c2d
    • Eric Dumazet's avatar
      net: increase SOMAXCONN to 4096 · 19f92a03
      Eric Dumazet authored
      
      
      SOMAXCONN is /proc/sys/net/core/somaxconn default value.
      
      It has been defined as 128 more than 20 years ago.
      
      Since it caps the listen() backlog values, the very small value has
      caused numerous problems over the years, and many people had
      to raise it on their hosts after beeing hit by problems.
      
      Google has been using 1024 for at least 15 years, and we increased
      this to 4096 after TCP listener rework has been completed, more than
      4 years ago. We got no complain of this change breaking any
      legacy application.
      
      Many applications indeed setup a TCP listener with listen(fd, -1);
      meaning they let the system select the backlog.
      
      Raising SOMAXCONN lowers chance of the port being unavailable under
      even small SYNFLOOD attack, and reduces possibilities of side channel
      vulnerabilities.
      
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19f92a03
    • Bjorn Andersson's avatar
      arm64: cpufeature: Enable Qualcomm Falkor errata 1009 for Kryo · 36c602dc
      Bjorn Andersson authored
      
      
      The Kryo cores share errata 1009 with Falkor, so add their model
      definitions and enable it for them as well.
      
      Signed-off-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      [will: Update entry in silicon-errata.rst]
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      36c602dc
  4. Oct 28, 2019
  5. Oct 25, 2019
  6. Oct 23, 2019
  7. Oct 17, 2019
  8. Oct 16, 2019
  9. Oct 15, 2019
  10. Oct 14, 2019
  11. Oct 11, 2019
  12. Oct 10, 2019
  13. Oct 09, 2019
  14. Oct 08, 2019
  15. Oct 07, 2019
    • Vlastimil Babka's avatar
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka authored
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz
      
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • Chris Down's avatar
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down authored
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name
      
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • Boris Ostrovsky's avatar
      x86/xen: Return from panic notifier · c6875f3a
      Boris Ostrovsky authored
      
      
      Currently execution of panic() continues until Xen's panic notifier
      (xen_panic_event()) is called at which point we make a hypercall that
      never returns.
      
      This means that any notifier that is supposed to be called later as
      well as significant part of panic() code (such as pstore writes from
      kmsg_dump()) is never executed.
      
      There is no reason for xen_panic_event() to be this last point in
      execution since panic()'s emergency_restart() will call into
      xen_emergency_restart() from where we can perform our hypercall.
      
      Nevertheless, we will provide xen_legacy_crash boot option that will
      preserve original behavior during crash. This option could be used,
      for example, if running kernel dumper (which happens after panic
      notifiers) is undesirable.
      
      Reported-by: default avatarJames Dingwall <james@dingwall.me.uk>
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      c6875f3a
    • Adam Zerella's avatar
      hwmon: docs: Extend inspur-ipsps1 title underline · 11c943a1
      Adam Zerella authored
      
      
      Sphinx is generating a build warning as the title underline
      of this file is too short.
      
      Signed-off-by: default avatarAdam Zerella <adam.zerella@gmail.com>
      Reviewed-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      11c943a1
    • Maxime Ripard's avatar
      dt-bindings: media: sun4i-csi: Drop the module clock · 90b32268
      Maxime Ripard authored
      
      
      It turns out that what was thought to be the module clock was actually the
      clock meant to be used by the sensor, and isn't playing any role with the
      CSI controller itself. Let's drop that clock from our binding.
      
      Fixes: c5e8f4cc ("media: dt-bindings: media: Add Allwinner A10 CSI binding")
      Reported-by: default avatarChen-Yu Tsai <wens@csie.org>
      Signed-off-by: default avatarMaxime Ripard <mripard@kernel.org>
      90b32268
    • Pragnesh Patel's avatar
      media: dt-bindings: Fix building error for dt_binding_check · e1056f9b
      Pragnesh Patel authored
      
      
      $id doesn't match the actual filename, so update the $id
      
      Fixes: c5e8f4cc ("media: dt-bindings: media: Add Allwinner A10 CSI binding")
      Signed-off-by: default avatarPragnesh Patel <pragnesh.patel@sifive.com>
      Signed-off-by: default avatarMaxime Ripard <mripard@kernel.org>
      e1056f9b
    • Yongqiang Niu's avatar
      dt-bindings: mediatek: add mutex description for mt8183 display · 41ee3b81
      Yongqiang Niu authored
      
      
      This patch add mutex description for mt8183 display
      
      Signed-off-by: default avatarYongqiang Niu <yongqiang.niu@mediatek.com>
      Acked-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarCK Hu <ck.hu@mediatek.com>
      41ee3b81
    • Yongqiang Niu's avatar
      dt-bindings: mediatek: add dither description for mt8183 display · 4df74719
      Yongqiang Niu authored
      
      
      Update device tree binding documention for the display subsystem for
      Mediatek MT8183 SOCs
      
      Signed-off-by: default avatarYongqiang Niu <yongqiang.niu@mediatek.com>
      Reviewed-by: Rob Herring <robh at kernel.org>
      Signed-off-by: default avatarCK Hu <ck.hu@mediatek.com>
      4df74719
Loading