Skip to content
  1. Aug 09, 2019
    • Steve Capper's avatar
      arm64: mm: Modify calculation of VMEMMAP_SIZE · ce3aaed8
      Steve Capper authored
      
      
      In a later patch we will need to have a slightly larger VMEMMAP region
      to accommodate boot time selection between 48/52-bit kernel VAs.
      
      This patch modifies the formula for computing VMEMMAP_SIZE to depend
      explicitly on the PAGE_OFFSET and start of kernel addressable memory.
      (This allows for a slightly larger direct linear map in future).
      
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      ce3aaed8
    • Steve Capper's avatar
      arm64: mm: Separate out vmemmap · c8b6d2cc
      Steve Capper authored
      
      
      vmemmap is a preprocessor definition that depends on a variable,
      memstart_addr. In a later patch we will need to expand the size of
      the VMEMMAP region and optionally modify vmemmap depending upon
      whether or not hardware support is available for 52-bit virtual
      addresses.
      
      This patch changes vmemmap to be a variable. As the old definition
      depended on a variable load, this should not affect performance
      noticeably.
      
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      c8b6d2cc
    • Steve Capper's avatar
      arm64: mm: Logic to make offset_ttbr1 conditional · c812026c
      Steve Capper authored
      
      
      When running with a 52-bit userspace VA and a 48-bit kernel VA we offset
      ttbr1_el1 to allow the kernel pagetables with a 52-bit PTRS_PER_PGD to
      be used for both userspace and kernel.
      
      Moving on to a 52-bit kernel VA we no longer require this offset to
      ttbr1_el1 should we be running on a system with HW support for 52-bit
      VAs.
      
      This patch introduces conditional logic to offset_ttbr1 to query
      SYS_ID_AA64MMFR2_EL1 whenever 52-bit VAs are selected. If there is HW
      support for 52-bit VAs then the ttbr1 offset is skipped.
      
      We choose to read a system register rather than vabits_actual because
      offset_ttbr1 can be called in places where the kernel data is not
      actually mapped.
      
      Calls to offset_ttbr1 appear to be made from rarely called code paths so
      this extra logic is not expected to adversely affect performance.
      
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      c812026c
    • Steve Capper's avatar
      arm64: mm: Introduce vabits_actual · 5383cc6e
      Steve Capper authored
      
      
      In order to support 52-bit kernel addresses detectable at boot time, one
      needs to know the actual VA_BITS detected. A new variable vabits_actual
      is introduced in this commit and employed for the KVM hypervisor layout,
      KASAN, fault handling and phys-to/from-virt translation where there
      would normally be compile time constants.
      
      In order to maintain performance in phys_to_virt, another variable
      physvirt_offset is introduced.
      
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      5383cc6e
    • Steve Capper's avatar
      arm64: mm: Introduce VA_BITS_MIN · 90ec95cd
      Steve Capper authored
      
      
      In order to support 52-bit kernel addresses detectable at boot time, the
      kernel needs to know the most conservative VA_BITS possible should it
      need to fall back to this quantity due to lack of hardware support.
      
      A new compile time constant VA_BITS_MIN is introduced in this patch and
      it is employed in the KASAN end address, KASLR, and EFI stub.
      
      For Arm, if 52-bit VA support is unavailable the fallback is to 48-bits.
      
      In other words: VA_BITS_MIN = min (48, VA_BITS)
      
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      90ec95cd
    • Steve Capper's avatar
      arm64: dump: De-constify VA_START and KASAN_SHADOW_START · 99426e5e
      Steve Capper authored
      
      
      The kernel page table dumper assumes that the placement of VA regions is
      constant and determined at compile time. As we are about to introduce
      variable VA logic, we need to be able to determine certain regions at
      boot time.
      
      Specifically the VA_START and KASAN_SHADOW_START will depend on whether
      or not the system is booted with 52-bit kernel VAs.
      
      This patch adds logic to the kernel page table dumper s.t. these regions
      can be computed at boot time.
      
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      99426e5e
    • Steve Capper's avatar
      arm64: kasan: Switch to using KASAN_SHADOW_OFFSET · 6bd1d0be
      Steve Capper authored
      
      
      KASAN_SHADOW_OFFSET is a constant that is supplied to gcc as a command
      line argument and affects the codegen of the inline address sanetiser.
      
      Essentially, for an example memory access:
          *ptr1 = val;
      The compiler will insert logic similar to the below:
          shadowValue = *(ptr1 >> KASAN_SHADOW_SCALE_SHIFT + KASAN_SHADOW_OFFSET)
          if (somethingWrong(shadowValue))
              flagAnError();
      
      This code sequence is inserted into many places, thus
      KASAN_SHADOW_OFFSET is essentially baked into many places in the kernel
      text.
      
      If we want to run a single kernel binary with multiple address spaces,
      then we need to do this with KASAN_SHADOW_OFFSET fixed.
      
      Thankfully, due to the way the KASAN_SHADOW_OFFSET is used to provide
      shadow addresses we know that the end of the shadow region is constant
      w.r.t. VA space size:
          KASAN_SHADOW_END = ~0 >> KASAN_SHADOW_SCALE_SHIFT + KASAN_SHADOW_OFFSET
      
      This means that if we increase the size of the VA space, the start of
      the KASAN region expands into lower addresses whilst the end of the
      KASAN region is fixed.
      
      Currently the arm64 code computes KASAN_SHADOW_OFFSET at build time via
      build scripts with the VA size used as a parameter. (There are build
      time checks in the C code too to ensure that expected values are being
      derived). It is sufficient, and indeed is a simplification, to remove
      the build scripts (and build time checks) entirely and instead provide
      KASAN_SHADOW_OFFSET values.
      
      This patch removes the logic to compute the KASAN_SHADOW_OFFSET in the
      arm64 Makefile, and instead we adopt the approach used by x86 to supply
      offset values in kConfig. To help debug/develop future VA space changes,
      the Makefile logic has been preserved in a script file in the arm64
      Documentation folder.
      
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      6bd1d0be
    • Steve Capper's avatar
      arm64: mm: Flip kernel VA space · 14c127c9
      Steve Capper authored
      
      
      In order to allow for a KASAN shadow that changes size at boot time, one
      must fix the KASAN_SHADOW_END for both 48 & 52-bit VAs and "grow" the
      start address. Also, it is highly desirable to maintain the same
      function addresses in the kernel .text between VA sizes. Both of these
      requirements necessitate us to flip the kernel address space halves s.t.
      the direct linear map occupies the lower addresses.
      
      This patch puts the direct linear map in the lower addresses of the
      kernel VA range and everything else in the higher ranges.
      
      We need to adjust:
       *) KASAN shadow region placement logic,
       *) KASAN_SHADOW_OFFSET computation logic,
       *) virt_to_phys, phys_to_virt checks,
       *) page table dumper.
      
      These are all small changes, that need to take place atomically, so they
      are bundled into this commit.
      
      As part of the re-arrangement, a guard region of 2MB (to preserve
      alignment for fixed map) is added after the vmemmap. Otherwise the
      vmemmap could intersect with IS_ERR pointers.
      
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      14c127c9
    • Steve Capper's avatar
      arm64: mm: Remove bit-masking optimisations for PAGE_OFFSET and VMEMMAP_START · 9cb1c5dd
      Steve Capper authored
      
      
      Currently there are assumptions about the alignment of VMEMMAP_START
      and PAGE_OFFSET that won't be valid after this series is applied.
      
      These assumptions are in the form of bitwise operators being used
      instead of addition and subtraction when calculating addresses.
      
      This patch replaces these bitwise operators with addition/subtraction.
      
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      9cb1c5dd
  2. Aug 03, 2019
  3. Aug 02, 2019
    • Masami Hiramatsu's avatar
      arm64: Make debug exception handlers visible from RCU · d8bb6718
      Masami Hiramatsu authored
      
      
      Make debug exceptions visible from RCU so that synchronize_rcu()
      correctly track the debug exception handler.
      
      This also introduces sanity checks for user-mode exceptions as same
      as x86's ist_enter()/ist_exit().
      
      The debug exception can interrupt in idle task. For example, it warns
      if we put a kprobe on a function called from idle task as below.
      The warning message showed that the rcu_read_lock() caused this
      problem. But actually, this means the RCU is lost the context which
      is already in NMI/IRQ.
      
        /sys/kernel/debug/tracing # echo p default_idle_call >> kprobe_events
        /sys/kernel/debug/tracing # echo 1 > events/kprobes/enable
        /sys/kernel/debug/tracing # [  135.122237]
        [  135.125035] =============================
        [  135.125310] WARNING: suspicious RCU usage
        [  135.125581] 5.2.0-08445-g9187c508bdc7 #20 Not tainted
        [  135.125904] -----------------------------
        [  135.126205] include/linux/rcupdate.h:594 rcu_read_lock() used illegally while idle!
        [  135.126839]
        [  135.126839] other info that might help us debug this:
        [  135.126839]
        [  135.127410]
        [  135.127410] RCU used illegally from idle CPU!
        [  135.127410] rcu_scheduler_active = 2, debug_locks = 1
        [  135.128114] RCU used illegally from extended quiescent state!
        [  135.128555] 1 lock held by swapper/0/0:
        [  135.128944]  #0: (____ptrval____) (rcu_read_lock){....}, at: call_break_hook+0x0/0x178
        [  135.130499]
        [  135.130499] stack backtrace:
        [  135.131192] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.2.0-08445-g9187c508bdc7 #20
        [  135.131841] Hardware name: linux,dummy-virt (DT)
        [  135.132224] Call trace:
        [  135.132491]  dump_backtrace+0x0/0x140
        [  135.132806]  show_stack+0x24/0x30
        [  135.133133]  dump_stack+0xc4/0x10c
        [  135.133726]  lockdep_rcu_suspicious+0xf8/0x108
        [  135.134171]  call_break_hook+0x170/0x178
        [  135.134486]  brk_handler+0x28/0x68
        [  135.134792]  do_debug_exception+0x90/0x150
        [  135.135051]  el1_dbg+0x18/0x8c
        [  135.135260]  default_idle_call+0x0/0x44
        [  135.135516]  cpu_startup_entry+0x2c/0x30
        [  135.135815]  rest_init+0x1b0/0x280
        [  135.136044]  arch_call_rest_init+0x14/0x1c
        [  135.136305]  start_kernel+0x4d4/0x500
        [  135.136597]
      
      So make debug exception visible to RCU can fix this warning.
      
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      d8bb6718
    • Masami Hiramatsu's avatar
      arm64: kprobes: Recover pstate.D in single-step exception handler · b3980e48
      Masami Hiramatsu authored
      
      
      kprobes manipulates the interrupted PSTATE for single step, and
      doesn't restore it. Thus, if we put a kprobe where the pstate.D
      (debug) masked, the mask will be cleared after the kprobe hits.
      
      Moreover, in the most complicated case, this can lead a kernel
      crash with below message when a nested kprobe hits.
      
      [  152.118921] Unexpected kernel single-step exception at EL1
      
      When the 1st kprobe hits, do_debug_exception() will be called.
      At this point, debug exception (= pstate.D) must be masked (=1).
      But if another kprobes hits before single-step of the first kprobe
      (e.g. inside user pre_handler), it unmask the debug exception
      (pstate.D = 0) and return.
      Then, when the 1st kprobe setting up single-step, it saves current
      DAIF, mask DAIF, enable single-step, and restore DAIF.
      However, since "D" flag in DAIF is cleared by the 2nd kprobe, the
      single-step exception happens soon after restoring DAIF.
      
      This has been introduced by commit 7419333f ("arm64: kprobe:
      Always clear pstate.D in breakpoint exception handler")
      
      To solve this issue, this stores all DAIF bits and restore it
      after single stepping.
      
      Reported-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Fixes: 7419333f ("arm64: kprobe: Always clear pstate.D in breakpoint exception handler")
      Reviewed-by: default avatarJames Morse <james.morse@arm.com>
      Tested-by: default avatarJames Morse <james.morse@arm.com>
      Signed-off-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      b3980e48
  4. Aug 01, 2019
  5. Jul 31, 2019
  6. Jul 30, 2019
    • Thomas Gleixner's avatar
      arm64: compat: vdso: Use legacy syscalls as fallback · 33a58980
      Thomas Gleixner authored
      
      
      The generic VDSO implementation uses the Y2038 safe clock_gettime64() and
      clock_getres_time64() syscalls as fallback for 32bit VDSO. This breaks
      seccomp setups because these syscalls might be not (yet) allowed.
      
      Implement the 32bit variants which use the legacy syscalls and select the
      variant in the core library.
      
      The 64bit time variants are not removed because they are required for the
      time64 based vdso accessors.
      
      Fixes: 00b26474 ("lib/vdso: Provide generic VDSO implementation")
      Reported-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reported-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Link: https://lkml.kernel.org/r/20190728131648.971361611@linutronix.de
      33a58980
    • Thomas Gleixner's avatar
      x86/vdso/32: Use 32bit syscall fallback · d2f5d3fa
      Thomas Gleixner authored
      
      
      The generic VDSO implementation uses the Y2038 safe clock_gettime64() and
      clock_getres_time64() syscalls as fallback for 32bit VDSO. This breaks
      seccomp setups because these syscalls might be not (yet) allowed.
      
      Implement the 32bit variants which use the legacy syscalls and select the
      variant in the core library.
      
      The 64bit time variants are not removed because they are required for the
      time64 based vdso accessors.
      
      Fixes: 7ac87074 ("x86/vdso: Switch to generic vDSO implementation")
      Reported-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reported-by: default avatarPaul Bolle <pebolle@tiscali.nl>
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Link: https://lkml.kernel.org/r/20190728131648.879156507@linutronix.de
      d2f5d3fa
    • Michael Ellerman's avatar
      powerpc/spe: Mark expected switch fall-throughs · 7db57e77
      Michael Ellerman authored
      
      
      Mark switch cases where we are expecting to fall through.
      
      Fixes errors such as below, seen with mpc85xx_defconfig:
      
        arch/powerpc/kernel/align.c: In function 'emulate_spe':
        arch/powerpc/kernel/align.c:178:8: error: this statement may fall through
          ret |= __get_user_inatomic(temp.v[3], p++);
              ^~
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190730141917.21817-1-mpe@ellerman.id.au
      7db57e77
    • Aneesh Kumar K.V's avatar
      powerpc/nvdimm: Pick nearby online node if the device node is not online · da1115fd
      Aneesh Kumar K.V authored
      
      
      Currently, nvdimm subsystem expects the device numa node for SCM device to be
      an online node. It also doesn't try to bring the device numa node online. Hence
      if we use a non-online numa node as device node we hit crashes like below. This
      is because we try to access uninitialized NODE_DATA in different code paths.
      
      cpu 0x0: Vector: 300 (Data Access) at [c0000000fac53170]
          pc: c0000000004bbc50: ___slab_alloc+0x120/0xca0
          lr: c0000000004bc834: __slab_alloc+0x64/0xc0
          sp: c0000000fac53400
         msr: 8000000002009033
         dar: 73e8
       dsisr: 80000
        current = 0xc0000000fabb6d80
        paca    = 0xc000000003870000   irqmask: 0x03   irq_happened: 0x01
          pid   = 7, comm = kworker/u16:0
      Linux version 5.2.0-06234-g76bd729b2644 (kvaneesh@ltc-boston123) (gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)) #135 SMP Thu Jul 11 05:36:30 CDT 2019
      enter ? for help
      [link register   ] c0000000004bc834 __slab_alloc+0x64/0xc0
      [c0000000fac53400] c0000000fac53480 (unreliable)
      [c0000000fac53500] c0000000004bc818 __slab_alloc+0x48/0xc0
      [c0000000fac53560] c0000000004c30a0 __kmalloc_node_track_caller+0x3c0/0x6b0
      [c0000000fac535d0] c000000000cfafe4 devm_kmalloc+0x74/0xc0
      [c0000000fac53600] c000000000d69434 nd_region_activate+0x144/0x560
      [c0000000fac536d0] c000000000d6b19c nd_region_probe+0x17c/0x370
      [c0000000fac537b0] c000000000d6349c nvdimm_bus_probe+0x10c/0x230
      [c0000000fac53840] c000000000cf3cc4 really_probe+0x254/0x4e0
      [c0000000fac538d0] c000000000cf429c driver_probe_device+0x16c/0x1e0
      [c0000000fac53950] c000000000cf0b44 bus_for_each_drv+0x94/0x130
      [c0000000fac539b0] c000000000cf392c __device_attach+0xdc/0x200
      [c0000000fac53a50] c000000000cf231c bus_probe_device+0x4c/0xf0
      [c0000000fac53a90] c000000000ced268 device_add+0x528/0x810
      [c0000000fac53b60] c000000000d62a58 nd_async_device_register+0x28/0xa0
      [c0000000fac53bd0] c0000000001ccb8c async_run_entry_fn+0xcc/0x1f0
      [c0000000fac53c50] c0000000001bcd9c process_one_work+0x46c/0x860
      [c0000000fac53d20] c0000000001bd4f4 worker_thread+0x364/0x5f0
      [c0000000fac53db0] c0000000001c7260 kthread+0x1b0/0x1c0
      [c0000000fac53e20] c00000000000b954 ret_from_kernel_thread+0x5c/0x68
      
      The patch tries to fix this by picking the nearest online node as the SCM node.
      This does have a problem of us losing the information that SCM node is
      equidistant from two other online nodes. If applications need to understand these
      fine-grained details we should express then like x86 does via
      /sys/devices/system/node/nodeX/accessY/initiators/
      
      With the patch we get
      
       # numactl -H
      available: 2 nodes (0-1)
      node 0 cpus:
      node 0 size: 0 MB
      node 0 free: 0 MB
      node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
      node 1 size: 130865 MB
      node 1 free: 129130 MB
      node distances:
      node   0   1
        0:  10  20
        1:  20  10
       # cat /sys/bus/nd/devices/region0/numa_node
      0
       # dmesg | grep papr_scm
      [   91.332305] papr_scm ibm,persistent-memory:ibm,pmemory@44104001: Region registered with target node 2 and online node 0
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190729095128.23707-1-aneesh.kumar@linux.ibm.com
      da1115fd
  7. Jul 29, 2019
Loading