- Aug 18, 2017
-
-
Benjamin Herrenschmidt authored
There is no guarantee that the various isync's involved with the context switch will order the update of the CPU mask with the first TLB entry for the new context being loaded by the HW. Be safe here and add a memory barrier to order any subsequent load/store which may bring entries into the TLB. The corresponding barrier on the other side already exists as pte updates use pte_xchg() which uses __cmpxchg_u64 which has a sync after the atomic operation. Cc: stable@vger.kernel.org Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by:
Nicholas Piggin <npiggin@gmail.com> [mpe: Add comments in the code] Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 16, 2017
-
-
Benjamin Herrenschmidt authored
VSX uses a combination of the old vector registers, the old FP registers and new "second halves" of the FP registers. Thus when we need to see the VSX state in the thread struct (flush_vsx_to_thread()) or when we'll use the VSX in the kernel (enable_kernel_vsx()) we need to ensure they are all flushed into the thread struct if either of them is individually enabled. Unfortunately we only tested if the whole VSX was enabled, not if they were individually enabled. Fixes: 72cd7b44 ("powerpc: Uncomment and make enable_kernel_vsx() routine available") Cc: stable@vger.kernel.org # v4.3+ Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 09, 2017
-
-
Nicholas Piggin authored
When CPUs start and stop the watchdog, they manipulate shared data that is normally protected by the lock. Other CPUs can be running concurrently at this time, so it's a good idea to use locking here to be on the safe side. Remove the barrier which is undocumented and didn't do anything. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When the SMP detector finds other CPUs stuck, it iterates over them and marks them as stuck. This pulls them out of the pending mask and allows the detector to continue with remaining good CPUs (if nmi_watchdog=panic is not enabled). The code to dothat was buggy because when setting a CPU stuck, if the pending mask became empty, it resets it to keep the watchdog running. However the iterator will continue to run over the new pending mask and mark remaining good CPUs sas stuck. Fix this by doing it with cpumask bitwise operations. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When the watchdog decides to panic, it takes the lock and double checks everything (to avoid races with the CPU being unstuck or panic()ed by something else). The exit label was misplaced and would result in all-CPUs backtrace and watchdog panic even in the case that the condition was found to be resolved. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
Some code can go into a tight loop calling touch_nmi_watchdog (e.g., stop_machine CPU hotplug code). This can cause contention on watchdog locks particularly if all CPUs with watchdog enabled are spinning in the loops. Avoid this storm of activity by running the watchdog timer callback from this path if we have exceeded the timer period since it was last run. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
- Hard-disable interrupts before taking the lock, which prevents soft-NMI re-entrancy and therefore can prevent deadlocks. - Use raw_ variants of local_irq_disable to avoid irq debugging. - When the lock is contended, spin at low SMT priority, using loads only, and with interrupts enabled (where possible). Some stalls have been noticed at high loads that go away with improved locking. There should not be so much locking contention in the first place (which is addressed in a subsequent patch), but locking should still be improved. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
When the NMI IPI lock is contended, spin at low SMT priority, using loads only, and with interrupts enabled (where possible). This improves behaviour under high contention (e.g., a system crash when a number of CPUs are trying to enter the debugger). Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
In commit 05a4a952 ("kernel/watchdog: split up config options"), CONFIG_LOCKUP_DETECTOR was split into two separate config options, CONFIG_HARDLOCKUP_DETECTOR and CONFIG_SOFTLOCKUP_DETECTOR. Our defconfigs still have CONFIG_LOCKUP_DETECTOR=y, but that is no longer user selectable, and we don't mention the new options, so we end up with none of them enabled. So update the defconfigs to turn on the new SOFT and HARD options, the end result being the same as what we had previously. Fixes: 05a4a952 ("kernel/watchdog: split up config options") Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 08, 2017
-
-
Gautham R. Shenoy authored
Currently, we use the opal call opal_slw_set_reg() to inform the Sleep-Winkle Engine (SLW) to restore the contents of some of the Hypervisor state on wakeup from deep idle states that lose full hypervisor context (characterized by the flag OPAL_PM_LOSE_FULL_CONTEXT). However, the current code has a bug in that if opal_slw_set_reg() fails, we don't disable the use of these deep states (winkle on POWER8, stop4 onwards on POWER9). This patch fixes this bug by ensuring that if programing the sleep-winkle engine to restore the hypervisor states in pnv_save_sprs_for_deep_states() fails, then we exclude such states by clearing the OPAL_PM_LOSE_FULL_CONTEXT flag from supported_cpuidle_states. As a result POWER8 will be prevented from using winkle for CPU-Hotplug, and POWER9 will put the offlined CPUs to the default stop state when available. Further, we ensure in the initialization of the cpuidle-powernv driver to only include those states whose flags are present in supported_cpuidle_states, thereby skipping OPAL_PM_LOSE_FULL_CONTEXT states when they have been disabled due to stop-api failure. Fixes: 1e1601b3 ("powerpc/powernv/idle: Restore SPRs for deep idle states via stop API.") Signed-off-by:
Gautham R. Shenoy <ego@linux.vnet.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 07, 2017
-
-
Michael Ellerman authored
This reverts commit bc4f65e4. As reported by Andreas, this commit is causing unrecoverable SLB misses in the system call exit path: Unrecoverable exception 4100 at c00000000000a1ec Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=2 PowerMac ... CPU: 0 PID: 18626 Comm: rm Not tainted 4.13.0-rc3 #1 task: c00000018335e080 task.stack: c000000139e50000 NIP: c00000000000a1ec LR: c00000000000a118 CTR: 0000000000000000 REGS: c000000139e53bb0 TRAP: 4100 Not tainted (4.13.0-rc3) MSR: 9000000000001030 <SF,HV,ME,IR,DR> CR: 24000044 XER: 20000000 SOFTE: 1 GPR00: 0000000000000000 c000000139e53e30 c000000000abb500 fffffffffffffffe GPR04: c0000001eb866298 0000000000000000 0000000000000000 c00000018335e080 GPR08: 900000000000d032 0000000000000000 0000000000000002 fffffffffffff001 GPR12: c000000139e50000 c00000000ffff000 00003fffa8c0dca0 00003fffa8c0dc88 GPR16: 0000000010000000 0000000000000001 00003fffa8c0eaa0 0000000000000000 GPR20: 00003fffa8c27528 00003fffa8c27b00 0000000000000000 0000000000000000 GPR24: 00003fffa8c0d918 00003ffff1b3efa0 00003fffa8c26d68 0000000000000000 GPR28: 00003fffa8c249e8 00003fffa8c263d0 00003fffa8c27550 00003ffff1b3ef10 NIP [c00000000000a1ec] system_call_exit+0xc0/0x21c LR [c00000000000a118] system_call+0x58/0x6c Call Trace: [c000000139e53e30] [c00000000000a118] system_call+0x58/0x6c (unreliable) Instruction dump: 64a51000 7c6300d0 f8a101a0 4bffff9c 3c000000 60000006 780007c6 64000000 60000000 7c004039 4082001c e8ed0170 <88070b78> 88c70b79 7c003214 2c200000 This is caused by us trying to load THREAD_LOAD_FP with MSR_RI=0, and taking an SLB miss on the thread struct. Reported-by:
Andreas Schwab <schwab@linux-m68k.org> Diagnosed-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 04, 2017
-
-
Nicholas Piggin authored
If the decrementer wraps again and de-asserts the decrementer exception while hard-disabled, __check_irq_replay() has a test to notice the wrap when interrupts are re-enabled. The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked because the decrementer interrupt was always the first one checked after clearing the hard disable flag, but HMI check was moved ahead of that, which introduced this bug. This can cause a missed decrementer interrupt if we soft-disable interrupts then take an HMI which is recorded in irq_happened, then hard-disable interrupts for > 4s to wrap the decrementer. Fixes: e0e0d6b7 ("powerpc/64: Replay hypervisor maintenance interrupt first") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
POWER9 DD2 PMU can stop after a state-loss idle in some conditions. A solution is to set then clear MMCRA[60] after wake from state-loss idle. MMCRA[60] is a non-architected bit, see the user manual for details. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Acked-by:
Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Reviewed-by:
Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Acked-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Aug 02, 2017
-
-
Sergei Shtylyov authored
of_irq_to_resource() has recently been fixed to return negative error #'s along with 0 in case of failure, however the Freescale MPC832x RDB board code still only regards 0 as a failure indication -- fix it up. Fixes: 7a4228bb ("of: irq: use of_irq_get() in of_irq_to_resource()") Signed-off-by:
Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by:
Scott Wood <oss@buserror.net> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 31, 2017
-
-
Nicholas Piggin authored
The watchdog soft-NMI exception stack setup loads a stack pointer twice, which is an obvious error. It ends up using the system reset interrupt (true-NMI) stack, which is also a bug because the watchdog could be preempted by a system reset interrupt that overwrites the NMI stack. Change the soft-NMI to use the "emergency stack". The current kernel stack is not used, because of the longer-term goal to prevent asynchronous stack access using soft-disable. Fixes: 2104180a ("powerpc/64s: implement arch-specific hardlockup watchdog") Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 28, 2017
-
-
Alistair Popple authored
Commit 8e3f1b1d ("powerpc/powernv/pci: Enable 64-bit devices to access >4GB DMA space") introduced the ability for PCI device drivers to request a DMA mask between 64 and 32 bits and actually get a mask greater than 32-bits. However currently if certain machine configuration dependent conditions are not meet the code silently falls back to a 32-bit mask. This makes it hard for device drivers to detect which mask they actually got. Instead we should return an error when the request could not be fulfilled which allows drivers to either fallback or implement other workarounds as documented in DMA-API-HOWTO.txt. Signed-off-by:
Alistair Popple <alistair@popple.id.au> Acked-by:
Russell Currey <ruscur@russell.cc> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Historically the boot wrapper was always built 32-bit big endian, even for 64-bit kernels. That was because old firmwares didn't necessarily support booting a 64-bit image. Because of that arch/powerpc/boot/Makefile uses CROSS32CC for compilation. However when we added 64-bit little endian support, we also added support for building the boot wrapper 64-bit. However we kept using CROSS32CC, because in most cases it is just CC and everything works. However if the user doesn't specify CROSS32_COMPILE (which no one ever does AFAIK), and CC is *not* biarch (32/64-bit capable), then CROSS32CC becomes just "gcc". On native systems that is probably OK, but if we're cross building it definitely isn't, leading to eg: gcc ... -m64 -mlittle-endian -mabi=elfv2 ... arch/powerpc/boot/cpm-serial.c gcc: error: unrecognized argument in option ‘-mabi=elfv2’ gcc: error: unrecognized command line option ‘-mlittle-endian’ make: *** [zImage] Error 2 To fix it, stop using CROSS32CC, because we may or may not be building 32-bit. Instead setup a BOOTCC, which defaults to CC, and only use CROSS32_COMPILE if it's set and we're building for 32-bit. Fixes: 147c0516 ("powerpc/boot: Add support for 64bit little endian wrapper") Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Reviewed-by:
Cyril Bur <cyrilbur@gmail.com>
-
Michael Ellerman authored
In smp_cpus_done() we need to call smp_ops->setup_cpu() for the boot CPU, which means it has to run *on* the boot CPU. In the past we ensured it ran on the boot CPU by changing the CPU affinity mask of current directly. That was removed in commit 6d11b87d ("powerpc/smp: Replace open coded task affinity logic"), and replaced with a work queue call. Unfortunately using a work queue leads to a lockdep warning, now that the CPU hotplug lock is a regular semaphore: ====================================================== WARNING: possible circular locking dependency detected ... kworker/0:1/971 is trying to acquire lock: (cpu_hotplug_lock.rw_sem){++++++}, at: [<c000000000100974>] apply_workqueue_attrs+0x34/0xa0 but task is already holding lock: ((&wfc.work)){+.+.+.}, at: [<c0000000000fdb2c>] process_one_work+0x25c/0x800 ... CPU0 CPU1 ---- ---- lock((&wfc.work)); lock(cpu_hotplug_lock.rw_sem); lock((&wfc.work)); lock(cpu_hotplug_lock.rw_sem); Although the deadlock can't happen in practice, because smp_cpus_done() only runs in early boot before CPU hotplug is allowed, lockdep can't tell that. Luckily in commit 8fb12156 ("init: Pin init task to the boot CPU, initially") tglx changed the generic code to pin init to the boot CPU to begin with. The unpinning of init from the boot CPU happens in sched_init_smp(), which is called after smp_cpus_done(). So smp_cpus_done() is always called on the boot CPU, which means we don't need the work queue call at all - and the lockdep warning goes away. Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Reviewed-by:
Thomas Gleixner <tglx@linutronix.de>
-
Gustavo Romero authored
Currently flush_tmregs_to_thread() does not save the TM SPRs (TFHAR, TFIAR, TEXASR) to the thread struct, unless the process is currently inside a suspended transaction. If the process is core dumping, and the TM SPRs have changed since the last time the process was context switched, then we will save stale values of the TM SPRs to the core dump. Fix it by saving the live register state to the thread struct in that case. Fixes: 08e1c01d ("powerpc/ptrace: Enable support for TM SPR state") Cc: stable@vger.kernel.org # v4.8+ Signed-off-by:
Gustavo Romero <gromero@linux.vnet.ibm.com> Reviewed-by:
Cyril Bur <cyrilbur@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Oliver O'Halloran authored
The Radix MMU translation tree as defined in ISA v3.0 contains two different types of entry, directories and leaves. Leaves are identified by _PAGE_PTE being set. The formats of the two entries are different, with the directory entries containing no spare bits for use by software. In particular the bit we use for _PAGE_DEVMAP is not reserved for software, and is part of the NLB (Next Level Base) field, essentially the address of the next level in the tree. Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even though we use a pmd_t for it in Linux. The fix is to ensure that the pmd/pte_devmap() confirm they are looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP. Fixes: ebd31197 ("powerpc/mm: Add devmap support for ppc64") Signed-off-by:
Oliver O'Halloran <oohall@gmail.com> Tested-by:
Laurent Vivier <lvivier@redhat.com> Tested-by:
Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Reviewed-by:
Suraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add a comment in the code and flesh out change log] Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 27, 2017
-
-
Aneesh Kumar K.V authored
Fixes: dad6f37c ("powerpc: subpage_protect: Increase the array size to take care of 64TB") Signed-off-by:
Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by:
Ram Pai <linuxram@us.ibm.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 26, 2017
-
-
Michael Ellerman authored
In commit efe0160c ("powerpc/64: Linker on-demand sfpr functions for modules"), we added an ld version check early in the powerpc top-level Makefile. Because the Makefile runs before the kernel config is setup, the checks for CONFIG_CPU_LITTLE_ENDIAN etc. all take the default case. So we end up configuring ld for 32-bit big endian. That would be OK, except that for historical (or perhaps no) reason, we use 'override LD' to add the endian flags to the LD variable itself, rather than the normal approach of adding them to LDFLAGS. The end result is that when we check the ld version we run it as: $(CROSS_COMPILE)ld -EB -m elf32ppc --version This often works, unless you are using a 64-bit only and/or little endian only, toolchain. In which case you see something like: $ make defconfig powerpc64le-linux-ld: unrecognised emulation mode: elf32ppc Supported emulations: elf64lppc elf32lppc elf32lppclinux elf32lppcsim /bin/sh: 1: [: -ge: unexpected operator The proper fix is to stop using 'override LD', but that will require a fair bit of testing. Instead we can fix it for now just by reordering the Makefile to do the version check earlier. Fixes: efe0160c ("powerpc/64: Linker on-demand sfpr functions for modules") Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Laurent Vivier authored
As for commit 68baf692 ("powerpc/pseries: Fix of_node_put() underflow during DLPAR remove"), the call to of_node_put() must be removed from pSeries_reconfig_remove_node(). dlpar_detach_node() and pSeries_reconfig_remove_node() both call of_detach_node(), and thus the node should not be released in both cases. Fixes: 0829f6d1 ("of: device_node kobject lifecycle fixes") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by:
Laurent Vivier <lvivier@redhat.com> Reviewed-by:
David Gibson <david@gibson.dropbear.id.au> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Benjamin Herrenschmidt authored
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by:
Benjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 18, 2017
-
-
Michael Ellerman authored
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Move the core logic into a helper, so we can use it for changing other permissions. We also change the logic to align start down, and end up. This means calling the function with a range will expand that range to be at least 1 mmu_linear_psize page in size. We need that so we can use it on __init_begin ... __init_end which is not a full page in size. This should always work for _stext/__init_begin, because we align __init_begin to _stext + 16M in the linker script. Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Reviewed-by:
Balbir Singh <bsingharora@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Michael Ellerman authored
Move the core logic into a helper, so we can use it for changing permissions other than _PAGE_WRITE. Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au> Reviewed-by:
Balbir Singh <bsingharora@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
A previous optimisation incorrectly assumed the PAPR hcall does not use r12, and clobbers it upon entry. In fact it is used as an input. This can result in KVM guests crashing (observed with PR KVM). Instead of using r12 to save r13, tihs patch saves r13 in ctr. This is more costly, but not as slow as using the SPRG. Fixes: acd7d8ce ("powerpc/64s: Optimize hypercall/syscall entry") Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
Nicholas Piggin authored
POWER9 DD2 can see spurious PMU interrupts after state-loss idle in some conditions. A solution is to save and reload MMCR0 over state-loss idle. Signed-off-by:
Nicholas Piggin <npiggin@gmail.com> Acked-by:
Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Tested-by:
Anton Blanchard <anton@samba.org> Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 17, 2017
-
-
Michael Ellerman authored
In commit 1c0eaf0f ("powerpc/powernv: Tell OPAL about our MMU mode on POWER9"), we added additional flags to the OPAL call to configure CPUs at boot. These flags only work on Power9 firmwares, and worse can cause boot failures on Power8 machines, so we check for CPU_FTR_ARCH_300 (aka POWER9) before adding the extra flags. Unfortunately we forgot that opal_configure_cores() is called before the CPU feature checks are dynamically patched, meaning the check always returns true. We definitely need to do something to make the CPU feature checks less prone to bugs like this, but for now the minimal fix is to use early_cpu_has_feature(). Reported-and-tested-by:
Abdul Haleem <abdhalee@linux.vnet.ibm.com> Fixes: 1c0eaf0f ("powerpc/powernv: Tell OPAL about our MMU mode on POWER9") Signed-off-by:
Michael Ellerman <mpe@ellerman.id.au>
-
- Jul 14, 2017
-
-
Tobias Klauser authored
The arch uses a verbatim copy of the asm-generic version and does not add any own implementations to the header, so use asm-generic/fb.h instead of duplicating code. Link: http://lkml.kernel.org/r/20170517083545.2115-1-tklauser@distanz.ch Signed-off-by:
Tobias Klauser <tklauser@distanz.ch> Acked-by:
Max Filippov <jcmvbkbc@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Roman Kagan authored
Hyper-V identifies vCPUs by Virtual Processor Index, which can be queried via HV_X64_MSR_VP_INDEX msr. It is defined by the spec as a sequential number which can't exceed the maximum number of vCPUs per VM. APIC ids can be sparse and thus aren't a valid replacement for VP indices. Current KVM uses its internal vcpu index as VP_INDEX. However, to make it predictable and persistent across VM migrations, the userspace has to control the value of VP_INDEX. This patch achieves that, by storing vp_index explicitly on vcpu, and allowing HV_X64_MSR_VP_INDEX to be set from the host side. For compatibility it's initialized to KVM vcpu index. Also a few variables are renamed to make clear distinction betweed this Hyper-V vp_index and KVM vcpu_id (== APIC id). Besides, a new capability, KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip attempting msr writes where unsupported, to avoid spamming error logs. Signed-off-by:
Roman Kagan <rkagan@virtuozzo.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
Wanpeng Li authored
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf returns 0 if in guest mode. This is similar to what svm.c wanted to do all along, but it is only enabled for Linux as L1 hypervisor. Foreign hypervisors must never receive async page faults as vmexits, because they'd probably be very confused about that. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by:
Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
Wanpeng Li authored
Add an nested_apf field to vcpu->arch.exception to identify an async page fault, and constructs the expected vm-exit information fields. Force a nested VM exit from nested_vmx_check_exception() if the injected #PF is async page fault. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by:
Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
Wanpeng Li authored
This patch adds the L1 guest async page fault #PF vmexit handler, such by L1 similar to ordinary async page fault. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by:
Wanpeng Li <wanpeng.li@hotmail.com> [Passed insn parameters to kvm_mmu_page_fault().] Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
Wanpeng Li authored
This patch removes all arguments except the first in kvm_x86_ops->queue_exception since they can extract the arguments from vcpu->arch.exception themselves. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by:
Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
- Jul 13, 2017
-
-
Roman Kagan authored
There is a flaw in the Hyper-V SynIC implementation in KVM: when message page or event flags page is enabled by setting the corresponding msr, KVM zeroes it out. This is problematic because on migration the corresponding MSRs are loaded on the destination, so the content of those pages is lost. This went unnoticed so far because the only user of those pages was in-KVM hyperv synic timers, which could continue working despite that zeroing. Newer QEMU uses those pages for Hyper-V VMBus implementation, and zeroing them breaks the migration. Besides, in newer QEMU the content of those pages is fully managed by QEMU, so zeroing them is undesirable even when writing the MSRs from the guest side. To support this new scheme, introduce a new capability, KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic pages aren't zeroed out in KVM. Signed-off-by:
Roman Kagan <rkagan@virtuozzo.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
Ladi Prosek authored
The backwards_tsc_observed global introduced in commit 16a96021 is never reset to false. If a VM happens to be running while the host is suspended (a common source of the TSC jumping backwards), master clock will never be enabled again for any VM. In contrast, if no VM is running while the host is suspended, master clock is unaffected. This is inconsistent and unnecessarily strict. Let's track the backwards_tsc_observed variable separately and let each VM start with a clean slate. Real world impact: My Windows VMs get slower after my laptop undergoes a suspend/resume cycle. The only way to get the perf back is unloading and reloading the kvm module. Signed-off-by:
Ladi Prosek <lprosek@redhat.com> Signed-off-by:
Radim Krčmář <rkrcmar@redhat.com>
-
- Jul 12, 2017
-
-
Joe Perches authored
Make the code like the rest of the kernel. Link: http://lkml.kernel.org/r/1cd3d401626e51ea0e2333a860e76e80bc560a4c.1499284835.git.joe@perches.com Signed-off-by:
Joe Perches <joe@perches.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Joe Perches authored
Make the code like the rest of the kernel. Link: http://lkml.kernel.org/r/f81bb2a67a97b1fd8b6ea99bd350d8a0f6864fb1.1499284835.git.joe@perches.com Signed-off-by:
Joe Perches <joe@perches.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-