Commits · 940f89a5a37789b94f332755767c556a64b004e4 · jan.koester / Linux

Nov 21, 2017

timer: Remove init_timer() interface · 7eeb6b89

Kees Cook authored Oct 11, 2017



All users of init_timer() have been updated. Remove the ancient interface.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Kees Cook <keescook@chromium.org>

7eeb6b89

x86/pkeys: Update documentation about availability · c51ff2c7

Dave Hansen authored Nov 10, 2017

Now that CPUs that implement Memory Protection Keys are publicly
available we can be a bit less oblique about where it is available.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171111001228.DC748A10@viggo.jf.intel.com

Signed-off-by: Ingo Molnar <mingo@kernel.org>

c51ff2c7

Nov 20, 2017

dt-bindings: rtc: imxdi: Improve the bindings text · 87c9fd81

Fabio Estevam authored Nov 15, 2017



Improve the bindings text by doing the following changes:

- Remove the i.MX53 reference, as the RTC on i.MX53 is a different hardware
- Add 'clocks' to the list of required properties
- Explain that the optional security violation irq is the second entry
- Use the real unit address and irq numbers for i.MX25

Signed-off-by: Fabio Estevam <fabio.estevam@nxp.com>
Acked-by: Juergen Borleis <jbe@pengutronix.de>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>

87c9fd81

dt-bindings: trivial-devices: Remove fsl,mc13892 · def4db33

Jonathan Neuschäfer authored Nov 18, 2017



This device's bindings are not trivial: Additional properties are
documented in in Documentation/devicetree/bindings/mfd/mc13xxx.txt.

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Reviewed-by: Fabio Estevam <fabio.estevam@nxp.com>
Signed-off-by: Rob Herring <robh@kernel.org>

def4db33

Documentation: fix profile= options in kernel-parameters.txt · e7e61fc0

Randy Dunlap authored Nov 19, 2017



Correctly the formatting of several additions to the profile= option
that have been added by using <profiletype> and listing the choices
for it.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

e7e61fc0

documentation/svga.txt: update outdated file · 0cf9bb67

Randy Dunlap authored Nov 19, 2017



Drop CONFIG_VIDEO_400_HACK info completely.
Drop CONFIG_VIDEO_RETAIN and CONFIG_VIDEO_LOCAL completely.
Drop CONFIG_VIDEO_COMPACT and CONFIG_VIDEO_VESA info completely.
Drop CONFIG_VIDEO_SVGA info since it has been removed.
Drop chapter number & section number references since they are wrong.
Drop (bad) ftp URL for 800x600 Thinkpad XF86Config.

Rename CONFIG_VIDEO_GFX_HACK to VIDEO_GFX_HACK since it is not a
Kconfig symbol. And to match the source code.

Build options are controlled by the kernel kconfig utility.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-By: Martin Mares <mj@ucw.cz>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

0cf9bb67

kokr/memory-barriers.txt: Fix typo in paring example · 80416fb4

SeongJae Park authored Nov 18, 2017



This commit applies an upstream change, commit d92f842b
("memory-barriers.txt: Fix typo in pairing example") to the Korean
translation.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

80416fb4

kokr/memory-barriers/txt: Replace uses of "transitive" · 578152da

SeongJae Park authored Nov 18, 2017



This commit applies two upstream change, commit f1ab25a3
("memory-barriers: Replace uses of "transitive"") and commit 0902b1f4
("memory-barriers: Rework multicopy-atomicity section") to the Korean
translation.  Those two changes are applied with this signle commit
because the second change is improvement of the first one.

Signed-off-by: SeongJae Park <sj38.park@gmail.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

578152da

Documentation/process: add Co-Developed-by: tag for patches with multiple authors · 8ee25f6f

Greg Kroah-Hartman authored Nov 16, 2017



Sometimes a single patch is the result of multiple authors.  As git only
can have one "author" of a patch, it is still good to properly give
credit to the other developers of a commit.  To address this, document
the "Co-Developed-by:" tag which can be used to show other authors of
the patch.

Note, these other authors must also provide a Signed-off-by: tag as it
is their work that is being submitted here.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>

8ee25f6f

Nov 19, 2017

NTB: switchtec_ntb: Update switchtec documentation with notes for NTB · d0450244

Logan Gunthorpe authored Aug 03, 2017



The switchtec_ntb driver has a couple requirements on the switchec's
hardware configuration so we add these notes to the documentation.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Stephen Bates <sbates@raithlin.com>
Reviewed-by: Kurt Schwemmer <kurt.schwemmer@microsemi.com>
Acked-by: Allen Hubbe <Allen.Hubbe@dell.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>

d0450244

Nov 18, 2017

kbuild: /bin/pwd -> pwd · 16f8259c

Bjørn Forsman authored Nov 05, 2017



Most places use pwd and rely on $PATH lookup. Moving the remaining
absolute path /bin/pwd users over for consistency.

Also, a reason for doing /bin/pwd -> pwd instead of the other way around
is because I believe build systems should make little assumptions on
host filesystem layout. Case in point, we do this kind of patching
already in NixOS.

Ref. commit 028568d8
("kbuild: revert $(realpath ...) to $(shell cd ... && /bin/pwd)").

Signed-off-by: Bjørn Forsman <bjorn.forsman@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

16f8259c

kcov: update documentation · c512ac01

Victor Chibotaru authored Nov 17, 2017

The updated documentation describes new KCOV mode for collecting
comparison operands.

Link: http://lkml.kernel.org/r/20171011095459.70721-3-glider@google.com


Signed-off-by: Victor Chibotaru <tchibo@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: <syzkaller@googlegroups.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c512ac01

Documentation/sysctl/vm.txt: fix typo · 2743232c

Kangmin Park authored Nov 17, 2017

Link: http://lkml.kernel.org/r/CAKW4uUyCi=PnKf3epgFVz8z=1tMtHSOHNm+fdNxrNw3-THvRCA@mail.gmail.com


Signed-off-by: Kangmin Park <l4stpr0gr4m@gmail.com>
Cc: Jiri Kosina <trivial@kernel.org>
Cc: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2743232c

dynamic_debug documentation: minor fixes · 8c188759

Randy Dunlap authored Nov 17, 2017

Fix minor typo.

Fix missing words in explaining parsing of last line number.

Link: http://lkml.kernel.org/r/ebb7ff42-4945-103f-d5b4-f07a6f3343a7@infradead.org


Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

8c188759

kernel debug: support resetting WARN*_ONCE · b1fca27d

Andi Kleen authored Nov 17, 2017

I like _ONCE warnings because it's guaranteed that they don't flood the
log.

During testing I find it useful to reset the state of the once warnings,
so that I can rerun tests and see if they trigger again, or can
guarantee that a test run always hits the same warnings.

This patch adds a debugfs interface to reset all the _ONCE warnings so
that they appear again:

  echo 1 > /sys/kernel/debug/clear_warn_once

This is implemented by putting all the warning booleans into a special
section, and clearing it.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20171017221455.6740-1-andi@firstfloor.org


Signed-off-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b1fca27d

proc, coredump: add CoreDumping flag to /proc/pid/status · c6434012

Roman Gushchin authored Nov 17, 2017

Right now there is no convenient way to check if a process is being
coredumped at the moment.

It might be necessary to recognize such state to prevent killing the
process and getting a broken coredump.  Writing a large core might take
significant time, and the process is unresponsive during it, so it might
be killed by timeout, if another process is monitoring and
killing/restarting hanging tasks.

We're getting a significant number of corrupted coredump files on
machines in our fleet, just because processes are being killed by
timeout in the middle of the core writing process.

We do have a process health check, and some agent is responsible for
restarting processes which are not responding for health check requests.
Writing a large coredump to the disk can easily exceed the reasonable
timeout (especially on an overloaded machine).

This flag will allow the agent to distinguish processes which are being
coredumped, extend the timeout for them, and let them produce a full
coredump file.

To provide an ability to detect if a process is in the state of being
coredumped, we can expose a boolean CoreDumping flag in
/proc/pid/status.

Example:
$ cat core.sh
  #!/bin/sh

  echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern
  sleep 1000 &
  PID=$!

  cat /proc/$PID/status | grep CoreDumping
  kill -ABRT $PID
  sleep 1
  cat /proc/$PID/status | grep CoreDumping

$ ./core.sh
  CoreDumping:	0
  CoreDumping:	1

[guro@fb.com: document CoreDumping flag in /proc/<pid>/status]
  Link: http://lkml.kernel.org/r/20170928135357.GA8470@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20170920230634.31572-1-guro@fb.com


Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

c6434012

Nov 16, 2017

PM / runtime: Drop children check from __pm_runtime_set_status() · f8817f61

Rafael J. Wysocki authored Nov 16, 2017



The check for "active" children in __pm_runtime_set_status(), when
trying to set the parent device status to "suspended", doesn't
really make sense, because in fact it is not invalid to set the
status of a device with runtime PM disabled to "suspended" in any
case.  It is invalid to enable runtime PM for a device with its
status set to "suspended" while its child_count reference counter
is nonzero, but the check in __pm_runtime_set_status() doesn't
really cover that situation.

For this reason, drop the children check from __pm_runtime_set_status()
and add a check against child_count reference counters of "suspended"
devices to pm_runtime_enable().

Fixes: a8636c89 (PM / Runtime: Don't allow to suspend a device with an active child)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Reviewed-by: Johan Hovold <johan@kernel.org>

f8817f61

dt-bindings: usb: document hub and host-controller properties · f877918c

Johan Hovold authored Nov 09, 2017



Hub nodes and host-controller nodes with child nodes must specify values
for #address-cells (1) and #size-cells (0).

Also make the definition of the related reg property a bit more
stringent, and add comments to the example source.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>

f877918c

dt-bindings: usb: clean up compatible property · bfebcf54

Johan Hovold authored Nov 09, 2017



Add quotation marks around the compatible string to avoid ambiguity due
to following punctuation, and define the VID and PID components.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>

bfebcf54

dt-bindings: usb: fix reg-property port-number range · f42ae7b0

Johan Hovold authored Nov 09, 2017



The USB hub port-number range for USB 2.0 is 1-255 and not 1-31 which
reflects an arbitrary limit set by the current Linux implementation.

Note that for USB 3.1 hubs the valid range is 1-15.

Increase the documented valid range in the binding to 255, which is the
maximum allowed by the specifications.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>

f42ae7b0

dt-bindings: usb: fix example hub node name · 1759f270

Johan Hovold authored Nov 09, 2017



According to the OF Recommended Practice for USB, hub nodes shall be
named "hub", but the example had mixed up the label and node names. Fix
the node name and drop the redundant label.

While at it, remove a newline and add a missing semicolon to the example
source.

Signed-off-by: Johan Hovold <johan@kernel.org>
Signed-off-by: Rob Herring <robh@kernel.org>

1759f270

sched/deadline: Fix the description of runtime accounting in the documentation · 5c0342ca

Claudio Scordino authored Nov 14, 2017



Signed-off-by: Claudio Scordino <claudio@evidence.eu.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1510658366-28995-1-git-send-email-claudio@evidence.eu.com


Signed-off-by: Ingo Molnar <mingo@kernel.org>

5c0342ca

mm, sysctl: make NUMA stats configurable · 4518085e

Kemi Wang authored Nov 15, 2017

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested
by Dave Hansen and Ying Huang.

=========================================================================

When page allocation performance becomes a bottleneck and you can
tolerate some possible tool breakage and decreased numa counter
precision, you can do:

	echo 0 > /proc/sys/vm/numa_stat

In this case, numa counter update is ignored.  We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and
reclaim on Jesper's page_bench01 (single thread) and *8.1%*(343->315)
drop of cpu cycles per single page allocation and reclaim on Jesper's
page_bench03 (88 threads) running on a 2-Socket Broadwell-based server
(88 threads, 126G memory).

Benchmark link provided by Jesper D Brouer (increase loop times to
10000000):

  https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench

=========================================================================

When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:

	echo 1 > /proc/sys/vm/numa_stat

This is system default setting.

Many thanks to Michal Hocko, Dave Hansen, Ying Huang and Vlastimil Babka
for comments to help improve the original patch.

[keescook@chromium.org: make sure mutex is a global static]
  Link: http://lkml.kernel.org/r/20171107213809.GA4314@beast
Link: http://lkml.kernel.org/r/1508290927-8518-1-git-send-email-kemi.wang@intel.com


Signed-off-by: Kemi Wang <kemi.wang@intel.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4518085e

kmemcheck: rip it out · 4675ff05

Levin, Alexander (Sasha Levin) authored Nov 15, 2017

Fix up makefiles, remove references, and git rm kmemcheck.

Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com

Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4675ff05

mm: consolidate page table accounting · af5b0f6a

Kirill A. Shutemov authored Nov 15, 2017

Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation.  We also provide the
information to userspace, but it has dubious value there too.

This patch switches page table accounting to single counter.

mm->pgtables_bytes is now used to account all page table levels.  We use
bytes, because page table size for different levels of page table tree
may be different.

The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status.  Not sure if anybody uses them.  (As
alternative, we can always report 0 kB for them.)

OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.

Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.

The only downside can be debuggability because we do not know which page
table level could leak.  But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.

[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com


Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
  Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com


Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

af5b0f6a

mm: account pud page tables · b4e98d9a

Kirill A. Shutemov authored Nov 15, 2017

On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com


Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

b4e98d9a

mm/mmu_notifier: avoid double notification when it is useless · 0f10851e

Jérôme Glisse authored Nov 15, 2017

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users.  Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range).  But that notification is not necessary
in all cases.

This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end.  It also adds documentation in all
those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space).  There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA.  This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock.  This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end

Thanks to Andrea for thinking of a problematic scenario for COW.

[jglisse@redhat.com: v2]
  Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com


Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0f10851e

mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical · 0f6d24f8

Yafang Shao authored Nov 15, 2017

The vm direct limit setting must be set greater than vm background limit
setting.  Otherwise print a warning to help the operator to figure out
that the vm dirtiness settings is in illogical state.

Link: http://lkml.kernel.org/r/1506592464-30962-1-git-send-email-laoar.shao@gmail.com

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

0f6d24f8

Nov 15, 2017

docs: dev-tools: coccinelle: delete out of date wiki reference · e9e716ff

Julia Lawall authored Nov 13, 2017



The wiki is no longer available.

Signed-off-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>

e9e716ff

KEYS: fix in-kernel documentation for keyctl_read() · be543dd6

Eric Biggers authored Nov 15, 2017



When keyctl_read() is passed a buffer that is too small, the behavior is
inconsistent.  Some key types will fill as much of the buffer as
possible, while others won't copy anything.  Moreover, the in-kernel
documentation contradicted the man page on this point.

Update the in-kernel documentation to say that this point is
unspecified.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: David Howells <dhowells@redhat.com>

be543dd6

dt-bindings: rtc: Add Spreadtrum SC27xx RTC documentation · 8f6596f4

Baolin Wang authored Nov 09, 2017



This patch adds the binding documentation for Spreadtrum SC27xx series
RTC device.

Signed-off-by: Baolin Wang <baolin.wang@spreadtrum.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>

8f6596f4

dt-bindings: pwm: Add R-Car D3 device tree bindings · ccb4e74a

Yoshihiro Shimoda authored Oct 04, 2017



Add device tree bindings for the PWM controller found on R-Car D3 SoCs.

Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Thierry Reding <thierry.reding@gmail.com>

ccb4e74a

block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP · a33801e8

Luca Miccio authored Nov 13, 2017

BFQ currently creates, and updates, its own instance of the whole
set of blkio statistics that cfq creates. Yet, from the comments
of Tejun Heo in [1], it turned out that most of these statistics
are meant/useful only for debugging. This commit makes BFQ create
the latter, debugging statistics only if the option
CONFIG_DEBUG_BLK_CGROUP is set.

By doing so, this commit also enables BFQ to enjoy a high perfomance
boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
BFQ has to update far fewer statistics, and, in particular, not the
heaviest to update.  To give an idea of the benefits, if
CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
with 8 threads doing random I/O in parallel on null_blk (configured
with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
(+30%). We have measured similar or even much higher boosts with other
CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
been obtained and can be reproduced very easily with the script in [1].

[1] https://www.spinics.net/lists/linux-block/msg18943.html



Suggested-by: Tejun Heo <tj@kernel.org>
Suggested-by: Ulf Hansson <ulf.hansson@linaro.org>
Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a33801e8

block, bfq: update blkio stats outside the scheduler lock · 24bfd19b

Paolo Valente authored Nov 13, 2017

bfq invokes various blkg_*stats_* functions to update the statistics
contained in the special files blkio.bfq.* in the blkio controller
groups, i.e., the I/O accounting related to the proportional-share
policy provided by bfq. The execution of these functions takes a
considerable percentage, about 40%, of the total per-request execution
time of bfq (i.e., of the sum of the execution time of all the bfq
functions that have to be executed to process an I/O request from its
creation to its destruction).  This reduces the request-processing
rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
the bfq functions that invoke blkg_*stats_* functions cannot be
executed in parallel with the rest of the code of bfq, because both
are executed under the same same per-device scheduler lock.

To reduce this slowdown, this commit moves, wherever possible, the
invocation of these functions (more precisely, of the bfq functions
that invoke blkg_*stats_* functions) outside the critical sections
protected by the scheduler lock.

With this change, and with all blkio.bfq.* statistics enabled, the
throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
i7-4850HQ, in case of 8 threads doing random I/O in parallel on
null_blk, with the latter configured with 0 latency. We obtained the
same or higher throughput boosts, up to +30%, with other processors
(some figures are reported in the documentation). For our tests, we
used the script [1], with which our results can be easily reproduced.

NOTE. This commit still protects the invocation of blkg_*stats_*
functions with the request_queue lock, because the group these
functions are invoked on may otherwise disappear before or while these
functions are executed.  Fortunately, tests without even this lock
show, by difference, that the serialization caused by this lock has a
little impact (at most ~5% of throughput reduction).

[1] https://github.com/Algodev-github/IOSpeed



Tested-by: Lee Tibbert <lee.tibbert@gmail.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

24bfd19b

doc, block, bfq: update max IOPS sustainable with BFQ · 68017e5d

Paolo Valente authored Nov 13, 2017

We have investigated more deeply the performance of BFQ, in terms of
number of IOPS that can be processed by the CPU when BFQ is used as
I/O scheduler. In more detail, using the script [1], we have measured
the number of IOPS reached on top of a null block device configured
with zero latency, as a function of the workload (sequential read,
sequential write, random read, random write) and of the system (we
considered desktops, laptops and embedded systems).

Basing on the resulting figures, with this commit we update the
current, conservative IOPS range reported in BFQ documentation. In
particular, the documentation now reports, for each of three different
systems, the lowest number of IOPS obtained for that system with the
above test (namely, the value obtained with the workload leading to
the lowest IOPS).

[1] https://github.com/Algodev-github/IOSpeed



Reviewed-by: Lee Tibbert <lee.tibbert@gmail.com>
Signed-off-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Luca Miccio <lucmiccio@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68017e5d

Nov 14, 2017

net: Mention net-next status web page in netdev-FAQ.txt · 8983487f

Harald Welte authored Nov 13, 2017

According to
  https://www.mail-archive.com/netdev@vger.kernel.org/msg177411.html
there is a status page available at
  http://vger.kernel.org/~davem/net-next.html


to obtain the current status of the net-next tree.  Let's add this
information to the netdev FAQ.

Signed-off-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

8983487f

net: Extend Kernel GTP-U tunneling documentation · 3ba88c47

Harald Welte authored Nov 13, 2017



* clarify specification references for v0/v1
* add section "APN vs. Network device"
* add section "Local GTP-U entity and tunnel identification"

Signed-off-by: Andreas Schultz <aschultz@tpip.net>
Signed-off-by: Harald Welte <laforge@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

3ba88c47

Nov 13, 2017

dt-bindings: usb: add #phy-cells to usb-nop-xceiv · aa25e446

Rob Herring authored Nov 09, 2017

Consumers of usb-nop-xceiv use the phy binding, but #phy-cells is missing
from the binding. This is probably because this binding predates the
common phy binding. So add #phy-cells as a required property. This should
not break any users as missing should be treated as 0 cells.

Signed-off-by: Rob Herring <robh@kernel.org>

aa25e446

Documentation: sound: hd-audio: notes.rst · 7087cb8f

Chris Gorman authored Nov 13, 2017



Fixed reference to file HD-Audio-Models.rst which has been moved to
hd-audio/models.rst

Signed-off-by: Chris Gorman <chrisjohgorman@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>

7087cb8f

afs: Fix documentation on # vs % prefix in mount source specification · becfcc7e

David Howells authored Nov 02, 2017



The documentation that describes the #-prefix and the %-prefix used when
specifying the source to mount is has the descriptions the wrong way
round.  Switch them over.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>

becfcc7e