Skip to content
  1. Apr 20, 2019
    • Michael Neuling's avatar
      powerpc: Add force enable of DAWR on P9 option · c1fe190c
      Michael Neuling authored
      
      
      This adds a flag so that the DAWR can be enabled on P9 via:
        echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
      
      The DAWR was previously force disabled on POWER9 in:
        96541531 powerpc: Disable DAWR in the base POWER9 CPU features
      Also see Documentation/powerpc/DAWR-POWER9.txt
      
      This is a dangerous setting, USE AT YOUR OWN RISK.
      
      Some users may not care about a bad user crashing their box
      (ie. single user/desktop systems) and really want the DAWR.  This
      allows them to force enable DAWR.
      
      This flag can also be used to disable DAWR access. Once this is
      cleared, all DAWR access should be cleared immediately and your
      machine once again safe from crashing.
      
      Userspace may get confused by toggling this. If DAWR is force
      enabled/disabled between getting the number of breakpoints (via
      PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
      inconsistent view of what's available. Similarly for guests.
      
      For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
      enabled in the host AND the guest. For this reason, this won't work on
      POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
      dawr_enable_dangerous file will fail if the hypervisor doesn't support
      writing the DAWR.
      
      To double check the DAWR is working, run this kernel selftest:
        tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
      Any errors/failures/skips mean something is wrong.
      
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c1fe190c
  2. Apr 11, 2019
  3. Mar 20, 2019
    • Ben Hutchings's avatar
      powerpc/mm: Only define MAX_PHYSMEM_BITS in SPARSEMEM configurations · 8bc08689
      Ben Hutchings authored
      
      
      MAX_PHYSMEM_BITS only needs to be defined if CONFIG_SPARSEMEM is
      enabled, and that was the case before commit 4ffe713b
      ("powerpc/mm: Increase the max addressable memory to 2PB").
      
      On 32-bit systems, where CONFIG_SPARSEMEM is not enabled, we now
      define it as 46.  That is larger than the real number of physical
      address bits, and breaks calculations in zsmalloc:
      
        mm/zsmalloc.c:130:49: warning: right shift count is negative
          MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
                                                         ^~
        ...
        mm/zsmalloc.c:253:21: error: variably modified 'size_class' at file scope
          struct size_class *size_class[ZS_SIZE_CLASSES];
                             ^~~~~~~~~~
      
      Fixes: 4ffe713b ("powerpc/mm: Increase the max addressable memory to 2PB")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8bc08689
  4. Mar 18, 2019
    • Michael Ellerman's avatar
      powerpc/vdso64: Fix CLOCK_MONOTONIC inconsistencies across Y2038 · b5b4453e
      Michael Ellerman authored
      Jakub Drnec reported:
        Setting the realtime clock can sometimes make the monotonic clock go
        back by over a hundred years. Decreasing the realtime clock across
        the y2k38 threshold is one reliable way to reproduce. Allegedly this
        can also happen just by running ntpd, I have not managed to
        reproduce that other than booting with rtc at >2038 and then running
        ntp. When this happens, anything with timers (e.g. openjdk) breaks
        rather badly.
      
      And included a test case (slightly edited for brevity):
        #define _POSIX_C_SOURCE 199309L
        #include <stdio.h>
        #include <time.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        long get_time(void) {
          struct timespec tp;
          clock_gettime(CLOCK_MONOTONIC, &tp);
          return tp.tv_sec + tp.tv_nsec / 1000000000;
        }
      
        int main(void) {
          long last = get_time();
          while(1) {
            long now = get_time();
            if (now < last) {
              printf("clock went backwards by %ld seconds!\n", last - now);
            }
            last = now;
            sleep(1);
          }
          return 0;
        }
      
      Which when run concurrently with:
       # date -s 2040-1-1
       # date -s 2037-1-1
      
      Will detect the clock going backward.
      
      The root cause is that wtom_clock_sec in struct vdso_data is only a
      32-bit signed value, even though we set its value to be equal to
      tk->wall_to_monotonic.tv_sec which is 64-bits.
      
      Because the monotonic clock starts at zero when the system boots the
      wall_to_montonic.tv_sec offset is negative for current and future
      dates. Currently on a freshly booted system the offset will be in the
      vicinity of negative 1.5 billion seconds.
      
      However if the wall clock is set past the Y2038 boundary, the offset
      from wall to monotonic becomes less than negative 2^31, and no longer
      fits in 32-bits. When that value is assigned to wtom_clock_sec it is
      truncated and becomes positive, causing the VDSO assembly code to
      calculate CLOCK_MONOTONIC incorrectly.
      
      That causes CLOCK_MONOTONIC to jump ahead by ~4 billion seconds which
      it is not meant to do. Worse, if the time is then set back before the
      Y2038 boundary CLOCK_MONOTONIC will jump backward.
      
      We can fix it simply by storing the full 64-bit offset in the
      vdso_data, and using that in the VDSO assembly code. We also shuffle
      some of the fields in vdso_data to avoid creating a hole.
      
      The original commit that added the CLOCK_MONOTONIC support to the VDSO
      did actually use a 64-bit value for wtom_clock_sec, see commit
      a7f290da ("[PATCH] powerpc: Merge vdso's and add vdso support to
      32 bits kernel") (Nov 2005). However just 3 days later it was
      converted to 32-bits in commit 0c37ec2a ("[PATCH] powerpc: vdso
      fixes (take #2)"), and the bug has existed since then AFAICS.
      
      Fixes: 0c37ec2a ("[PATCH] powerpc: vdso fixes (take #2)")
      Cc: stable@vger.kernel.org # v2.6.15+
      Link: http://lkml.kernel.org/r/HaC.ZfES.62bwlnvAvMP.1STMMj@seznam.cz
      
      
      Reported-by: default avatarJakub Drnec <jaydee@email.cz>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      b5b4453e
  5. Mar 17, 2019
  6. Mar 06, 2019
  7. Mar 05, 2019
    • Aneesh Kumar K.V's avatar
      powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration · 35f2806b
      Aneesh Kumar K.V authored
      
      
      We added runtime allocation of 16G pages in commit 4ae279c2
      ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.") That was done
      to enable 16G allocation on PowerNV and KVM config. In case of KVM
      config, we mostly would have the entire guest RAM backed by 16G
      hugetlb pages for this to work. PAPR do support partial backing of
      guest RAM with hugepages via ibm,expected#pages node of memory node in
      the device tree. This means rest of the guest RAM won't be backed by
      16G contiguous pages in the host and hence a hash page table insertion
      can fail in such case.
      
      An example error message will look like
      
        hash-mmu: mm: Hashing failure ! EA=0x7efc00000000 access=0x8000000000000006 current=readback
        hash-mmu:     trap=0x300 vsid=0x67af789 ssize=1 base psize=14 psize 14 pte=0xc000000400000386
        readback[12260]: unhandled signal 7 at 00007efc00000000 nip 00000000100012d0 lr 000000001000127c code 2
      
      This patch address that by preventing runtime allocation of 16G
      hugepages in LPAR config. To allocate 16G hugetlb one need to kernel
      command line hugepagesz=16G hugepages=<number of 16G pages>
      
      With radix translation mode we don't run into this issue.
      
      This change will prevent runtime allocation of 16G hugetlb pages on
      kvm with hash translation mode. However, with the current upstream it
      was observed that 16G hugetlbfs backed guest doesn't boot at all.
      
      We observe boot failure with the below message:
        [131354.647546] KVM: map_vrma at 0 failed, ret=-4
      
      That means this patch is not resulting in an observable regression.
      Once we fix the boot issue with 16G hugetlb backed memory, we need to
      use ibm,expected#pages memory node attribute to indicate 16G page
      reservation to the guest. This will also enable partial backing of
      guest RAM with 16G pages.
      
      Fixes: 4ae279c2 ("powerpc/mm/hugetlb: Allow runtime allocation of 16G.")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      35f2806b
  8. Mar 04, 2019
    • Linus Torvalds's avatar
      get rid of legacy 'get_ds()' function · 736706be
      Linus Torvalds authored
      
      
      Every in-kernel use of this function defined it to KERNEL_DS (either as
      an actual define, or as an inline function).  It's an entirely
      historical artifact, and long long long ago used to actually read the
      segment selector valueof '%ds' on x86.
      
      Which in the kernel is always KERNEL_DS.
      
      Inspired by a patch from Jann Horn that just did this for a very small
      subset of users (the ones in fs/), along with Al who suggested a script.
      I then just took it to the logical extreme and removed all the remaining
      gunk.
      
      Roughly scripted with
      
         git grep -l '(get_ds())' -- :^tools/ | xargs sed -i 's/(get_ds())/(KERNEL_DS)/'
         git grep -lw 'get_ds' -- :^tools/ | xargs sed -i '/^#define get_ds()/d'
      
      plus manual fixups to remove a few unusual usage patterns, the couple of
      inline function cases and to fix up a comment that had become stale.
      
      The 'get_ds()' function remains in an x86 kvm selftest, since in user
      space it actually does something relevant.
      
      Inspired-by: default avatarJann Horn <jannh@google.com>
      Inspired-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      736706be
  9. Mar 01, 2019
  10. Feb 26, 2019
    • Paul Mackerras's avatar
      KVM: PPC: Fix compilation when KVM is not enabled · e74d53e3
      Paul Mackerras authored
      
      
      Compiling with CONFIG_PPC_POWERNV=y and KVM disabled currently gives
      an error like this:
      
        CC      arch/powerpc/kernel/dbell.o
      In file included from arch/powerpc/kernel/dbell.c:20:0:
      arch/powerpc/include/asm/kvm_ppc.h: In function ‘xics_on_xive’:
      arch/powerpc/include/asm/kvm_ppc.h:625:9: error: implicit declaration of function ‘xive_enabled’ [-Werror=implicit-function-declaration]
        return xive_enabled() && cpu_has_feature(CPU_FTR_HVMODE);
               ^
      cc1: all warnings being treated as errors
      scripts/Makefile.build:276: recipe for target 'arch/powerpc/kernel/dbell.o' failed
      make[3]: *** [arch/powerpc/kernel/dbell.o] Error 1
      
      Fix this by making the xics_on_xive() definition conditional on the
      same symbol (CONFIG_KVM_BOOK3S_64_HANDLER) that determines whether we
      include <asm/xive.h> or not, since that's the header that defines
      xive_enabled().
      
      Fixes: 03f95332 ("KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e74d53e3
    • Nicholas Piggin's avatar
      powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C · 75d9fc7f
      Nicholas Piggin authored
      
      
      The OPAL call wrapper gets interrupt disabling wrong. It disables
      interrupts just by clearing MSR[EE], which has two problems:
      
      - It doesn't call into the IRQ tracing subsystem, which means tracing
        across OPAL calls does not always notice IRQs have been disabled.
      
      - It doesn't go through the IRQ soft-mask code, which causes a minor
        bug. MSR[EE] can not be restored by saving the MSR then clearing
        MSR[EE], because a racing interrupt while soft-masked could clear
        MSR[EE] between the two steps. This can cause MSR[EE] to be
        incorrectly enabled when the OPAL call returns. Fortunately that
        should only result in another masked interrupt being taken to
        disable MSR[EE] again, but it's a bit sloppy.
      
      The existing code also saves MSR to PACA, which is not re-entrant if
      there is a nested OPAL call from different MSR contexts, which can
      happen these days with SRESET interrupts on bare metal.
      
      To fix these issues, move the tracing and IRQ handling code to C, and
      call into asm just for the low level call when everything is ready to
      go. Save the MSR on stack rather than PACA.
      
      Performance cost is kept to a minimum with a few optimisations:
      
      - The endian switch upon return is combined with the MSR restore,
        which avoids an expensive context synchronizing operation for LE
        kernels. This makes up for the additional mtmsrd to enable
        interrupts with local_irq_enable().
      
      - blr is now used to return from the opal_* functions that are called
        as C functions, to avoid link stack corruption. This requires a
        skiboot fix as well to keep the call stack balanced.
      
      A NULL call is more costly after this, (410ns->430ns on POWER9), but
      OPAL calls are generally not performance critical at this scale.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      75d9fc7f
    • Nicholas Piggin's avatar
      powerpc/64s: Fix HV NMI vs HV interrupt recoverability test · ccd47702
      Nicholas Piggin authored
      
      
      HV interrupts that use HSRR registers do not enter with MSR[RI] clear,
      but their entry code is not recoverable vs NMI, due to shared use of
      HSPRG1 as a scratch register to save r13.
      
      This means that a system reset or machine check that hits in HSRR
      interrupt entry can cause r13 to be silently corrupted.
      
      Fix this by marking NMIs non-recoverable if they land in HV interrupt
      ranges.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      ccd47702
  11. Feb 25, 2019
  12. Feb 23, 2019
  13. Feb 21, 2019
Loading