Skip to content
  1. Nov 28, 2016
  2. Nov 24, 2016
  3. Nov 23, 2016
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00 · 2ee13be3
      Suraj Jitindar Singh authored
      
      
      The function kvmppc_set_arch_compat() is used to determine the value of the
      processor compatibility register (PCR) for a guest running in a given
      compatibility mode. There is currently no support for v3.00 of the ISA.
      
      Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
      to the PCR.
      
      We also add a check to ensure the processor we are running on is capable of
      emulating the chosen processor (for example a POWER7 cannot emulate a
      POWER8, similarly with a POWER8 and a POWER9).
      
      Based on work by: Paul Mackerras <paulus@ozlabs.org>
      
      [paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
       guest_pcr_bit when arch_compat == 0, added comment.]
      
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      2ee13be3
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores · 45c940ba
      Paul Mackerras authored
      
      
      With POWER9, each CPU thread has its own MMU context and can be
      in the host or a guest independently of the other threads; there is
      still however a restriction that all threads must use the same type
      of address translation, either radix tree or hashed page table (HPT).
      
      Since we only support HPT guests on a HPT host at this point, we
      can treat the threads as being independent, and avoid all of the
      work of coordinating the CPU threads.  To make this simpler, we
      introduce a new threads_per_vcore() function that returns 1 on
      POWER9 and threads_per_subcore on POWER7/8, and use that instead
      of threads_per_subcore or threads_per_core in various places.
      
      This also changes the value of the KVM_CAP_PPC_SMT capability on
      POWER9 systems from 4 to 1, so that userspace will not try to
      create VMs with multiple vcpus per vcore.  (If userspace did create
      a VM that thought it was in an SMT mode, the VM might try to use
      the msgsndp instruction, which will not work as expected.  In
      future it may be possible to trap and emulate msgsndp in order to
      allow VMs to think they are in an SMT mode, if only for the purpose
      of allowing migration from POWER8 systems.)
      
      With all this, we can now run guests on POWER9 as long as the host
      is running with HPT translation.  Since userspace currently has no
      way to request radix tree translation for the guest, the guest has
      no choice but to use HPT translation.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      45c940ba
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest · 84f7139c
      Paul Mackerras authored
      
      
      The new XIVE interrupt controller on POWER9 can direct external
      interrupts to the hypervisor or the guest.  The interrupts directed to
      the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
      come in as a "hypervisor virtualization interrupt".  This sets the
      LPCR bit so that hypervisor virtualization interrupts can occur while
      we are in the guest.  We then also need to cope with exiting the guest
      because of a hypervisor virtualization interrupt.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      84f7139c
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use stop instruction rather than nap on POWER9 · bf53c88e
      Paul Mackerras authored
      
      
      POWER9 replaces the various power-saving mode instructions on POWER8
      (doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
      a register, PSSCR, which controls the depth of the power-saving mode.
      This replaces the use of the nap instruction when threads are idle
      during guest execution with the stop instruction, and adds code to
      set PSSCR to a value which will allow an SMT mode switch while the
      thread is idle (given that the core as a whole won't be idle in these
      cases).
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      bf53c88e
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 · f725758b
      Paul Mackerras authored
      
      
      POWER9 includes a new interrupt controller, called XIVE, which is
      quite different from the XICS interrupt controller on POWER7 and
      POWER8 machines.  KVM-HV accesses the XICS directly in several places
      in order to send and clear IPIs and handle interrupts from PCI
      devices being passed through to the guest.
      
      In order to make the transition to XIVE easier, OPAL firmware will
      include an emulation of XICS on top of XIVE.  Access to the emulated
      XICS is via OPAL calls.  The one complication is that the EOI
      (end-of-interrupt) function can now return a value indicating that
      another interrupt is pending; in this case, the XIVE will not signal
      an interrupt in hardware to the CPU, and software is supposed to
      acknowledge the new interrupt without waiting for another interrupt
      to be delivered in hardware.
      
      This adapts KVM-HV to use the OPAL calls on machines where there is
      no XICS hardware.  When there is no XICS, we look for a device-tree
      node with "ibm,opal-intc" in its compatible property, which is how
      OPAL indicates that it provides XICS emulation.
      
      In order to handle the EOI return value, kvmppc_read_intr() has
      become kvmppc_read_one_intr(), with a boolean variable passed by
      reference which can be set by the EOI functions to indicate that
      another interrupt is pending.  The new kvmppc_read_intr() keeps
      calling kvmppc_read_one_intr() until there are no more interrupts
      to process.  The return value from kvmppc_read_intr() is the
      largest non-zero value of the returns from kvmppc_read_one_intr().
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      f725758b
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9 · 1704a81c
      Paul Mackerras authored
      
      
      On POWER9, the msgsnd instruction is able to send interrupts to
      other cores, as well as other threads on the local core.  Since
      msgsnd is generally simpler and faster than sending an IPI via the
      XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      1704a81c
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras authored
      
      
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs · e9cf1e08
      Paul Mackerras authored
      
      
      This adds code to handle two new guest-accessible special-purpose
      registers on POWER9: TIDR (thread ID register) and PSSCR (processor
      stop status and control register).  They are context-switched
      between host and guest, and the guest values can be read and set
      via the one_reg interface.
      
      The PSSCR contains some fields which are guest-accessible and some
      which are only accessible in hypervisor mode.  We only allow the
      guest-accessible fields to be read or set by userspace.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      e9cf1e08
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9 · 83677f55
      Paul Mackerras authored
      
      
      Some special-purpose registers that were present and accessible
      by guests on POWER8 no longer exist on POWER9, so this adds
      feature sections to ensure that we don't try to context-switch
      them when going into or out of a guest on POWER9.  These are
      all relatively obscure, rarely-used registers, but we had to
      context-switch them on POWER8 to avoid creating a covert channel.
      They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      83677f55
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9 · 7a84084c
      Paul Mackerras authored
      
      
      On POWER9, the SDR1 register (hashed page table base address) is no
      longer used, and instead the hardware reads the HPT base address
      and size from the partition table.  The partition table entry also
      contains the bits that specify the page size for the VRMA mapping,
      which were previously in the LPCR.  The VPM0 bit of the LPCR is
      now reserved; the processor now always uses the VRMA (virtual
      real-mode area) mechanism for guest real-mode accesses in HPT mode,
      and the RMO (real-mode offset) mechanism has been dropped.
      
      When entering or exiting the guest, we now only have to set the
      LPIDR (logical partition ID register), not the SDR1 register.
      There is also no requirement now to transition via a reserved
      LPID value.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      7a84084c
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Adapt to new HPTE format on POWER9 · abb7c7dd
      Paul Mackerras authored
      
      
      This adapts the KVM-HV hashed page table (HPT) code to read and write
      HPT entries in the new format defined in Power ISA v3.00 on POWER9
      machines.  The new format moves the B (segment size) field from the
      first doubleword to the second, and trims some bits from the AVA
      (abbreviated virtual address) and ARPN (abbreviated real page number)
      fields.  As far as possible, the conversion is done when reading or
      writing the HPT entries, and the rest of the code continues to use
      the old format.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      abb7c7dd
    • Michael Neuling's avatar
      powerpc/powernv: Define and set POWER9 HFSCR doorbell bit · 02ed21ae
      Michael Neuling authored
      
      
      Define and set the POWER9 HFSCR doorbell bit so that guests can use
      msgsndp.
      
      ISA 3.0 calls this MSGP, so name it accordingly in the code.
      
      Signed-off-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      02ed21ae
  4. Nov 22, 2016
  5. Nov 21, 2016
  6. Nov 16, 2016
    • Paul Mackerras's avatar
      powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format · 6b243fcf
      Paul Mackerras authored
      
      
      This changes the way that we support the new ISA v3.00 HPTE format.
      Instead of adapting everything that uses HPTE values to handle either
      the old format or the new format, depending on which CPU we are on,
      we now convert explicitly between old and new formats if necessary
      in the low-level routines that actually access HPTEs in memory.
      This limits the amount of code that needs to know about the new
      format and makes the conversions explicit.  This is OK because the
      old format contains all the information that is in the new format.
      
      This also fixes operation under a hypervisor, because the H_ENTER
      hypercall (and other hypercalls that deal with HPTEs) will continue
      to require the HPTE value to be supplied in the old format.  At
      present the kernel will not boot in HPT mode on POWER9 under a
      hypervisor.
      
      This fixes and partially reverts commit 50de596d
      ("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
      
      Fixes: 50de596d ("powerpc/mm/hash: Add support for Power9 Hash")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6b243fcf
  7. Oct 29, 2016
  8. Oct 28, 2016
    • Jiri Olsa's avatar
      perf/powerpc: Don't call perf_event_disable() from atomic context · 5aab90ce
      Jiri Olsa authored
      
      
      The trinity syscall fuzzer triggered following WARN() on powerpc:
      
        WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
        ...
        NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
        LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
        Call Trace:
        [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      Followed by a lockdep warning:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.8.0-rc5+ #7 Tainted: G        W
        -------------------------------
        ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 1, debug_locks = 0
        2 locks held by ls/2998:
         #0:  (rcu_read_lock){......}, at: [<c0000000000f6a00>] .__atomic_notifier_call_chain+0x0/0x1c0
         #1:  (rcu_read_lock){......}, at: [<c00000000093ac50>] .hw_breakpoint_handler+0x0/0x2b0
      
        stack backtrace:
        CPU: 9 PID: 2998 Comm: ls Tainted: G        W       4.8.0-rc5+ #7
        Call Trace:
        [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
        [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
        [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
        [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
        [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
        [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
        [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
        [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
        [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
        [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
        [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
        [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48
      
      While it looks like the first WARN() is probably valid, the other one is
      triggered by disabling event via perf_event_disable() from atomic context.
      
      The event is disabled here in case we were not able to emulate
      the instruction that hit the breakpoint. By disabling the event
      we unschedule the event and make sure it's not scheduled back.
      
      But we can't call perf_event_disable() from atomic context, instead
      we need to use the event's pending_disable irq_work method to disable it.
      
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      5aab90ce
  9. Oct 27, 2016
    • Nicholas Piggin's avatar
      powerpc/64s: relocation, register save fixes for system reset interrupt · fb479e44
      Nicholas Piggin authored
      
      
      This patch does a couple of things. First of all, powernv immediately
      explodes when running a relocated kernel, because the system reset
      exception for handling sleeps does not do correct relocated branches.
      
      Secondly, the sleep handling code trashes the condition and cfar
      registers, which we would like to preserve for debugging purposes (for
      non-sleep case exception).
      
      This patch changes the exception to use the standard format that saves
      registers before any tests or branches are made. It adds the test for
      idle-wakeup as an "extra" to break out of the normal exception path.
      Then it branches to a relocated idle handler that calls the various
      idle handling functions.
      
      After this patch, POWER8 CPU simulator now boots powernv kernel that is
      running at non-zero.
      
      Fixes: 948cf67c ("powerpc: Add NAP mode support on Power7 in HV mode")
      Cc: stable@vger.kernel.org # v3.0+
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Acked-by: default avatarGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Acked-by: default avatarBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      fb479e44
    • Aneesh Kumar K.V's avatar
      powerpc/mm/radix: Use tlbiel only if we ever ran on the current cpu · bd77c449
      Aneesh Kumar K.V authored
      
      
      Before this patch, we used tlbiel, if we ever ran only on this core.
      That was mostly derived from the nohash usage of the same. But is
      incorrect, the ISA 3.0 clarifies tlbiel such that:
      
      "All TLB entries that have all of the following properties are made
      invalid on the thread executing the tlbiel instruction"
      
      ie. tlbiel only invalidates TLB entries on the current thread. So if the
      mm has been used on any other thread (aka. cpu) then we must broadcast
      the invalidate.
      
      This bug could lead to invalid TLB entries if a program runs on multiple
      threads of a core.
      
      Hence use tlbiel, if we only ever ran on only the current cpu.
      
      Fixes: 1a472c9d ("powerpc/mm/radix: Add tlbflush routines")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      bd77c449
    • Valentin Rothberg's avatar
      powerpc/process: Fix CONFIG_ALIVEC typo in restore_tm_state() · 39715bf9
      Valentin Rothberg authored
      
      
      It should be ALTIVEC, not ALIVEC.
      
      Cyril explains: If a thread performs a transaction with altivec and then
      gets preempted for whatever reason, this bug may cause the kernel to not
      re-enable altivec when that thread runs again. This will result in an
      altivec unavailable fault, when that fault happens inside a user
      transaction the kernel has no choice but to enable altivec and doom the
      transaction.
      
      The result is that transactions using altivec may get aborted more often
      than they should.
      
      The difficulty in catching this with a selftest is my deliberate use of
      the word may above. Optimisations to avoid FPU/altivec/VSX faults mean
      that the kernel will always leave them on for 255 switches. This code
      prevents the kernel turning it off if it got to the 256th switch (and
      userspace was transactional).
      
      Fixes: dc16b553 ("powerpc: Always restore FPU/VEC/VSX if hardware transactional memory in use")
      Reviewed-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: default avatarValentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      39715bf9
  10. Oct 24, 2016
    • Paul Mackerras's avatar
      powerpc/64: Fix race condition in setting lock bit in idle/wakeup code · 09b7e37b
      Paul Mackerras authored
      
      
      This fixes a race condition where one thread that is entering or
      leaving a power-saving state can inadvertently ignore the lock bit
      that was set by another thread, and potentially also clear it.
      The core_idle_lock_held function is called when the lock bit is
      seen to be set.  It polls the lock bit until it is clear, then
      does a lwarx to load the word containing the lock bit and thread
      idle bits so it can be updated.  However, it is possible that the
      value loaded with the lwarx has the lock bit set, even though an
      immediately preceding lwz loaded a value with the lock bit clear.
      If this happens then we go ahead and update the word despite the
      lock bit being set, and when called from pnv_enter_arch207_idle_mode,
      we will subsequently clear the lock bit.
      
      No identifiable misbehaviour has been attributed to this race.
      
      This fixes it by checking the lock bit in the value loaded by the
      lwarx.  If it is set then we just go back and keep on polling.
      
      Fixes: b32aadc1 ("powerpc/powernv: Fix race in updating core_idle_state")
      Cc: stable@vger.kernel.org # v4.2+
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      09b7e37b
    • Paul Mackerras's avatar
      powerpc/64: Re-fix race condition between going idle and entering guest · 56c46222
      Paul Mackerras authored
      
      
      Commit 8117ac6a ("powerpc/powernv: Switch off MMU before entering
      nap/sleep/rvwinkle mode", 2014-12-10) fixed a race condition where one
      thread entering a KVM guest could switch the MMU context to the guest
      while another thread was still in host kernel context with the MMU on.
      That commit moved the point where a thread entering a power-saving
      mode set its kvm_hstate.hwthread_state field in its PACA to
      KVM_HWTHREAD_IN_IDLE from a point where the MMU was on to after the
      MMU had been switched off.  That commit also added a comment
      explaining that we have to switch to real mode before setting
      hwthread_state to avoid this race.
      
      Nevertheless, commit 4eae2c9a ("powerpc/powernv: Make
      pnv_powersave_common more generic", 2016-07-08) subsequently moved
      the setting of hwthread_state back to a point where the MMU is on,
      thus reintroducing the race, despite the comment saying that this
      should not be done being included in full in the context lines of
      the patch that did it.
      
      This fixes the race again and adds a bigger and shoutier comment
      explaining the potential race condition.
      
      Fixes: 4eae2c9a ("powerpc/powernv: Make pnv_powersave_common more generic")
      Cc: stable@vger.kernel.org # v4.8+
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarShreyas B. Prabhu <shreyasbp@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      56c46222
  11. Oct 21, 2016
Loading