Skip to content
  1. Jul 18, 2015
    • Dave Hansen's avatar
      x86/fpu, sched: Dynamically allocate 'struct fpu' · 0c8c0f03
      Dave Hansen authored
      
      
      The FPU rewrite removed the dynamic allocations of 'struct fpu'.
      But, this potentially wastes massive amounts of memory (2k per
      task on systems that do not have AVX-512 for instance).
      
      Instead of having a separate slab, this patch just appends the
      space that we need to the 'task_struct' which we dynamically
      allocate already.  This saves from doing an extra slab
      allocation at fork().
      
      The only real downside here is that we have to stick everything
      and the end of the task_struct.  But, I think the
      BUILD_BUG_ON()s I stuck in there should keep that from being too
      fragile.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0c8c0f03
  2. Jul 17, 2015
    • Andy Lutomirski's avatar
      x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code · a97439aa
      Andy Lutomirski authored
      
      
      It turns out to be rather tedious to test the NMI nesting code.
      Make it easier: add a new CONFIG_DEBUG_ENTRY option that causes
      the NMI handler to pre-emptively unmask NMIs.
      
      With this option set, errors in the repeat_nmi logic or failures
      to detect that we're in a nested NMI will result in quick panics
      under perf (especially if multiple counters are running at high
      frequency) instead of requiring an unusual workload that
      generates page faults or breakpoints inside NMIs.
      
      I called it CONFIG_DEBUG_ENTRY instead of CONFIG_DEBUG_NMI_ENTRY
      because I want to add new non-NMI checks elsewhere in the entry
      code in the future, and I'd rather not add too many new config
      options or add this option and then immediately rename it.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a97439aa
    • Andy Lutomirski's avatar
      x86/nmi/64: Make the "NMI executing" variable more consistent · 36f1a77b
      Andy Lutomirski authored
      
      
      Currently, "NMI executing" is one the first time an outermost
      NMI hits repeat_nmi and zero thereafter.  Change it to be zero
      each time for consistency.
      
      This is intended to help NMI handling fail harder if it's buggy.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      36f1a77b
    • Andy Lutomirski's avatar
      x86/nmi/64: Minor asm simplification · 23a781e9
      Andy Lutomirski authored
      
      
      Replace LEA; MOV with an equivalent SUB.  This saves one
      instruction.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      23a781e9
    • Andy Lutomirski's avatar
      x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 810bc075
      Andy Lutomirski authored
      
      
      We have a tricky bug in the nested NMI code: if we see RSP
      pointing to the NMI stack on NMI entry from kernel mode, we
      assume that we are executing a nested NMI.
      
      This isn't quite true.  A malicious userspace program can point
      RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
      happen while RSP is still pointing at the NMI stack.
      
      Fix it with a sneaky trick.  Set DF in the region of code that
      the RSP check is intended to detect.  IRET will clear DF
      atomically.
      
      ( Note: other than paravirt, there's little need for all this
        complexity. We could check RIP instead of RSP. )
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      810bc075
    • Andy Lutomirski's avatar
      x86/nmi/64: Reorder nested NMI checks · a27507ca
      Andy Lutomirski authored
      
      
      Check the repeat_nmi .. end_repeat_nmi special case first.  The
      next patch will rework the RSP check and, as a side effect, the
      RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
      we'll need this ordering of the checks.
      
      Note: this is more subtle than it appears.  The check for
      repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
      instead of adjusting the "iret" frame to force a repeat.  This
      is necessary, because the code between repeat_nmi and
      end_repeat_nmi sets "NMI executing" and then writes to the
      "iret" frame itself.  If a nested NMI comes in and modifies the
      "iret" frame while repeat_nmi is also modifying it, we'll end up
      with garbage.  The old code got this right, as does the new
      code, but the new code is a bit more explicit.
      
      If we were to move the check right after the "NMI executing"
      check, then we'd get it wrong and have random crashes.
      
      ( Because the "NMI executing" check would jump to the code that would
        modify the "iret" frame without checking if the interrupted NMI was
        currently modifying it. )
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a27507ca
    • Andy Lutomirski's avatar
      x86/nmi/64: Improve nested NMI comments · 0b22930e
      Andy Lutomirski authored
      
      
      I found the nested NMI documentation to be difficult to follow.
      Improve the comments.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0b22930e
    • Andy Lutomirski's avatar
      x86/nmi/64: Switch stacks on userspace NMI entry · 9b6e6a83
      Andy Lutomirski authored
      
      
      Returning to userspace is tricky: IRET can fail, and ESPFIX can
      rearrange the stack prior to IRET.
      
      The NMI nesting fixup relies on a precise stack layout and
      atomic IRET.  Rather than trying to teach the NMI nesting fixup
      to handle ESPFIX and failed IRET, punt: run NMIs that came from
      user mode on the normal kernel stack.
      
      This will make some nested NMIs visible to C code, but the C
      code is okay with that.
      
      As a side effect, this should speed up perf: it eliminates an
      RDMSR when NMIs come from user mode.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9b6e6a83
    • Andy Lutomirski's avatar
      x86/nmi/64: Remove asm code that saves CR2 · 0e181bb5
      Andy Lutomirski authored
      
      
      Now that do_nmi saves CR2, we don't need to save it in asm.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Acked-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0e181bb5
    • Andy Lutomirski's avatar
      x86/nmi: Enable nested do_nmi() handling for 64-bit kernels · 9d050416
      Andy Lutomirski authored
      
      
      32-bit kernels handle nested NMIs in C.  Enable the exact same
      handling on 64-bit kernels as well.  This isn't currently
      necessary, but it will become necessary once the asm code starts
      allowing limited nesting.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      9d050416
  3. Jul 13, 2015
    • Heiko Carstens's avatar
      s390/nmi: fix vector register corruption · cad49cfc
      Heiko Carstens authored
      
      
      If a machine check happens, the machine has the vector facility installed
      and the extended save area exists, the cpu will save vector register
      contents into the extended save area. This is regardless of control
      register 0 contents, which enables and disables the vector facility during
      runtime.
      
      On each machine check we should validate the vector registers. The current
      code however tries to validate the registers only if the running task is
      using vector registers in user space.
      
      However even the current code is broken and causes vector register
      corruption on machine checks, if user space uses them:
      the prefix area contains a pointer (absolute address) to the machine check
      extended save area. In order to save some space the save area was put into
      an unused area of the second prefix page.
      When validating vector register contents the code uses the absolute address
      of the extended save area, which is wrong. Due to prefixing the vector
      instructions will then access contents using absolute addresses instead
      of real addresses, where the machine stored the contents.
      
      If the above would work there is still the problem that register validition
      would only happen if user space uses vector registers. If kernel space uses
      them also, this may also lead to vector register content corruption:
      if the kernel makes use of vector instructions, but the current running
      user space context does not, the machine check handler will validate
      floating point registers instead of vector registers.
      Given the fact that writing to a floating point register may change the
      upper halve of the corresponding vector register, we also experience vector
      register corruption in this case.
      
      Fix all of these issues, and always validate vector registers on each
      machine check, if the machine has the vector facility installed and the
      extended save area is defined.
      
      Cc: <stable@vger.kernel.org> # 4.1+
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      cad49cfc
    • Heiko Carstens's avatar
      s390/process: fix sfpc inline assembly · e47994dd
      Heiko Carstens authored
      
      
      The sfpc inline assembly within execve_tail() may incorrectly set bits
      28-31 of the sfpc instruction to a value which is not zero.
      These bits however are currently unused and therefore should be zero
      so we won't get surprised if these bits will be used in the future.
      
      Therefore remove the second operand from the inline assembly.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: default avatarMartin Schwidefsky <schwidefsky@de.ibm.com>
      e47994dd
    • Vineet Gupta's avatar
      ARCv2: support HS38 releases · 624b71ee
      Vineet Gupta authored
      
      
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      624b71ee
    • Alexey Brodkin's avatar
      ARC: make sure instruction_pointer() returns unsigned value · f51e2f19
      Alexey Brodkin authored
      
      
      Currently instruction_pointer() returns pt_regs->ret and so return value
      is of type "long", which implicitly stands for "signed long".
      
      While that's perfectly fine when dealing with 32-bit values if return
      value of instruction_pointer() gets assigned to 64-bit variable sign
      extension may happen.
      
      And at least in one real use-case it happens already.
      In perf_prepare_sample() return value of perf_instruction_pointer()
      (which is an alias to instruction_pointer() in case of ARC) is assigned
      to (struct perf_sample_data)->ip (which type is "u64").
      
      And what we see if instuction pointer points to user-space application
      that in case of ARC lays below 0x8000_0000 "ip" gets set properly with
      leading 32 zeros. But if instruction pointer points to kernel address
      space that starts from 0x8000_0000 then "ip" is set with 32 leadig
      "f"-s. I.e. id instruction_pointer() returns 0x8100_0000, "ip" will be
      assigned with 0xffff_ffff__8100_0000. Which is obviously wrong.
      
      In particular that issuse broke output of perf, because perf was unable
      to associate addresses like 0xffff_ffff__8100_0000 with anything from
      /proc/kallsyms.
      
      That's what we used to see:
       ----------->8----------
        6.27%  ls       [unknown]                [k] 0xffffffff8046c5cc
        2.96%  ls       libuClibc-0.9.34-git.so  [.] memcpy
        2.25%  ls       libuClibc-0.9.34-git.so  [.] memset
        1.66%  ls       [unknown]                [k] 0xffffffff80666536
        1.54%  ls       libuClibc-0.9.34-git.so  [.] 0x000224d6
        1.18%  ls       libuClibc-0.9.34-git.so  [.] 0x00022472
       ----------->8----------
      
      With that change perf output looks much better now:
       ----------->8----------
        8.21%  ls       [kernel.kallsyms]        [k] memset
        3.52%  ls       libuClibc-0.9.34-git.so  [.] memcpy
        2.11%  ls       libuClibc-0.9.34-git.so  [.] malloc
        1.88%  ls       libuClibc-0.9.34-git.so  [.] memset
        1.64%  ls       [kernel.kallsyms]        [k] _raw_spin_unlock_irqrestore
        1.41%  ls       [kernel.kallsyms]        [k] __d_lookup_rcu
       ----------->8----------
      
      Signed-off-by: default avatarAlexey Brodkin <abrodkin@synopsys.com>
      Cc: arc-linux-dev@synopsys.com
      Cc: stable@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      f51e2f19
  4. Jul 12, 2015
  5. Jul 10, 2015
    • John David Anglin's avatar
      parisc: Fix some PTE/TLB race conditions and optimize __flush_tlb_range based on timing results · 01ab6057
      John David Anglin authored
      
      
      The increased use of pdtlb/pitlb instructions seemed to increase the
      frequency of random segmentation faults building packages. Further, we
      had a number of cases where TLB inserts would repeatedly fail and all
      forward progress would stop. The Haskell ghc package caused a lot of
      trouble in this area. The final indication of a race in pte handling was
      this syslog entry on sibaris (C8000):
      
       swap_free: Unused swap offset entry 00000004
       BUG: Bad page map in process mysqld  pte:00000100 pmd:019bbec5
       addr:00000000ec464000 vm_flags:00100073 anon_vma:0000000221023828 mapping: (null) index:ec464
       CPU: 1 PID: 9176 Comm: mysqld Not tainted 4.0.0-2-parisc64-smp #1 Debian 4.0.5-1
       Backtrace:
        [<0000000040173eb0>] show_stack+0x20/0x38
        [<0000000040444424>] dump_stack+0x9c/0x110
        [<00000000402a0d38>] print_bad_pte+0x1a8/0x278
        [<00000000402a28b8>] unmap_single_vma+0x3d8/0x770
        [<00000000402a4090>] zap_page_range+0xf0/0x198
        [<00000000402ba2a4>] SyS_madvise+0x404/0x8c0
      
      Note that the pte value is 0 except for the accessed bit 0x100. This bit
      shouldn't be set without the present bit.
      
      It should be noted that the madvise system call is probably a trigger for many
      of the random segmentation faults.
      
      In looking at the kernel code, I found the following problems:
      
      1) The pte_clear define didn't take TLB lock when clearing a pte.
      2) We didn't test pte present bit inside lock in exception support.
      3) The pte and tlb locks needed to merged in order to ensure consistency
      between page table and TLB. This also has the effect of serializing TLB
      broadcasts on SMP systems.
      
      The attached change implements the above and a few other tweaks to try
      to improve performance. Based on the timing code, TLB purges are very
      slow (e.g., ~ 209 cycles per page on rp3440). Thus, I think it
      beneficial to test the split_tlb variable to avoid duplicate purges.
      Probably, all PA 2.0 machines have combined TLBs.
      
      I dropped using __flush_tlb_range in flush_tlb_mm as I realized all
      applications and most threads have a stack size that is too large to
      make this useful. I added some comments to this effect.
      
      Since implementing 1 through 3, I haven't had any random segmentation
      faults on mx3210 (rp3440) in about one week of building code and running
      as a Debian buildd.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Cc: stable@vger.kernel.org # v3.18+
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      01ab6057
    • Mark Rutland's avatar
      arm64: entry32: remove pointless register assignment · ad2daa85
      Mark Rutland authored
      
      
      We currently set x27 in compat_sys_sigreturn_wrapper and
      compat_sys_rt_sigreturn_wrapper, similarly to what we do with r8/why on
      32-bit ARM, in an attempt to prevent sigreturns from being restarted.
      
      However, on arm64 we have always used pt_regs::syscallno for syscall
      restarting (for both native and compat tasks), and x27 is never
      inspected again before being overwritten in kernel_exit.
      
      This patch removes the pointless register assignments.
      
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      ad2daa85
    • Wanpeng Li's avatar
      kvm: x86: fix load xsave feature warning · ee4100da
      Wanpeng Li authored
      
      
      [   68.196974] WARNING: CPU: 1 PID: 2140 at arch/x86/kvm/x86.c:3161 kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]()
      [   68.196975] Modules linked in: snd_hda_codec_hdmi i915 rfcomm bnep bluetooth i2c_algo_bit rfkill nfsd drm_kms_helper nfs_acl nfs drm lockd grace sunrpc fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_dummy snd_seq_oss x86_pkg_temp_thermal snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi kvm snd_seq ghash_clmulni_intel fuse snd_timer aesni_intel parport_pc ablk_helper snd_seq_device cryptd ppdev snd lp parport lrw dcdbas gf128mul i2c_core glue_helper lpc_ich video shpchp mfd_core soundcore serio_raw acpi_cpufreq ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci libata e1000e ptp pps_core
      [   68.197005] CPU: 1 PID: 2140 Comm: qemu-system-x86 Not tainted 4.2.0-rc1+ #2
      [   68.197006] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
      [   68.197007]  ffffffffa03b0657 ffff8800d984bca8 ffffffff815915a2 0000000000000000
      [   68.197009]  0000000000000000 ffff8800d984bce8 ffffffff81057c0a 00007ff6d0001000
      [   68.197010]  0000000000000002 ffff880211c1a000 0000000000000004 ffff8800ce0288c0
      [   68.197012] Call Trace:
      [   68.197017]  [<ffffffff815915a2>] dump_stack+0x45/0x57
      [   68.197020]  [<ffffffff81057c0a>] warn_slowpath_common+0x8a/0xc0
      [   68.197022]  [<ffffffff81057cfa>] warn_slowpath_null+0x1a/0x20
      [   68.197029]  [<ffffffffa037bed8>] kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]
      [   68.197035]  [<ffffffffa037aede>] ? kvm_arch_vcpu_load+0x4e/0x1c0 [kvm]
      [   68.197040]  [<ffffffffa03696a6>] kvm_vcpu_ioctl+0xc6/0x5c0 [kvm]
      [   68.197043]  [<ffffffff811252d2>] ? perf_pmu_enable+0x22/0x30
      [   68.197044]  [<ffffffff8112663e>] ? perf_event_context_sched_in+0x7e/0xb0
      [   68.197048]  [<ffffffff811a6882>] do_vfs_ioctl+0x2c2/0x4a0
      [   68.197050]  [<ffffffff8107bf33>] ? finish_task_switch+0x173/0x220
      [   68.197053]  [<ffffffff8123307f>] ? selinux_file_ioctl+0x4f/0xd0
      [   68.197055]  [<ffffffff8122cac3>] ? security_file_ioctl+0x43/0x60
      [   68.197057]  [<ffffffff811a6ad9>] SyS_ioctl+0x79/0x90
      [   68.197060]  [<ffffffff81597e57>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [   68.197061] ---[ end trace 558a5ebf9445fc80 ]---
      
      After commit (0c4109be 'x86/fpu/xstate: Fix up bad get_xsave_addr()
      assumptions'), there is no assumption an xsave bit is present in the
      hardware (pcntxt_mask) that it is always present in a given xsave buffer.
      An enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'
      could happen when the last 'xsave' did not request that this feature be
      saved (unlikely) or because the "init optimization" caused it to not be
      saved. This patch kill the assumption.
      
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ee4100da
    • Paolo Bonzini's avatar
      KVM: x86: apply guest MTRR virtualization on host reserved pages · fd717f11
      Paolo Bonzini authored
      
      
      Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
      However, the guest could prefer a different page type than UC for
      such pages. A good example is that pass-throughed VGA frame buffer is
      not always UC as host expected.
      
      This patch enables full use of virtual guest MTRRs.
      
      Suggested-by: default avatarXiao Guangrong <guangrong.xiao@linux.intel.com>
      Tested-by: Joerg Roedel <jroedel@suse.de> (on AMD)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fd717f11
    • Jan Kiszka's avatar
      KVM: SVM: Sync g_pat with guest-written PAT value · e098223b
      Jan Kiszka authored
      
      
      When hardware supports the g_pat VMCB field, we can use it for emulating
      the PAT configuration that the guest configures by writing to the
      corresponding MSR.
      
      Signed-off-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Tested-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e098223b
    • Paolo Bonzini's avatar
      KVM: SVM: use NPT page attributes · 3c2e7f7d
      Paolo Bonzini authored
      
      
      Right now, NPT page attributes are not used, and the final page
      attribute depends solely on gPAT (which however is not synced
      correctly), the guest MTRRs and the guest page attributes.
      
      However, we can do better by mimicking what is done for VMX.
      In the absence of PCI passthrough, the guest PAT can be ignored
      and the page attributes can be just WB.  If passthrough is being
      used, instead, keep respecting the guest PAT, and emulate the guest
      MTRRs through the PAT field of the nested page tables.
      
      The only snag is that WP memory cannot be emulated correctly,
      because Linux's default PAT setting only includes the other types.
      
      Tested-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3c2e7f7d
    • Paolo Bonzini's avatar
      KVM: count number of assigned devices · 5544eb9b
      Paolo Bonzini authored
      
      
      If there are no assigned devices, the guest PAT are not providing
      any useful information and can be overridden to writeback; VMX
      always does this because it has the "IPAT" bit in its extended
      page table entries, but SVM does not have anything similar.
      Hook into VFIO and legacy device assignment so that they
      provide this information to KVM.
      
      Reviewed-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Tested-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5544eb9b
    • Radim Krčmář's avatar
      KVM: VMX: fix vmwrite to invalid VMCS · 370777da
      Radim Krčmář authored
      
      
      fpu_activate is called outside of vcpu_load(), which means it should not
      touch VMCS, but fpu_activate needs to.  Avoid the call by moving it to a
      point where we know that the guest needs eager FPU and VMCS is loaded.
      
      This will get rid of the following trace
      
       vmwrite error: reg 6800 value 0 (err 1)
        [<ffffffff8162035b>] dump_stack+0x19/0x1b
        [<ffffffffa046c701>] vmwrite_error+0x2c/0x2e [kvm_intel]
        [<ffffffffa045f26f>] vmcs_writel+0x1f/0x30 [kvm_intel]
        [<ffffffffa04617e5>] vmx_fpu_activate.part.61+0x45/0xb0 [kvm_intel]
        [<ffffffffa0461865>] vmx_fpu_activate+0x15/0x20 [kvm_intel]
        [<ffffffffa0560b91>] kvm_arch_vcpu_create+0x51/0x70 [kvm]
        [<ffffffffa0548011>] kvm_vm_ioctl+0x1c1/0x760 [kvm]
        [<ffffffff8118b55a>] ? handle_mm_fault+0x49a/0xec0
        [<ffffffff811e47d5>] do_vfs_ioctl+0x2e5/0x4c0
        [<ffffffff8127abbe>] ? file_has_perm+0xae/0xc0
        [<ffffffff811e4a51>] SyS_ioctl+0xa1/0xc0
        [<ffffffff81630949>] system_call_fastpath+0x16/0x1b
      
      (Note: we also unconditionally activate FPU in vmx_vcpu_reset(), so the
       removed code added nothing.)
      
      Fixes: c447e76b ("kvm/fpu: Enable eager restore kvm FPU for MPX")
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarVlastimil Holer <vlastimil.holer@gmail.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      370777da
    • Paolo Bonzini's avatar
      KVM: x86: reintroduce kvm_is_mmio_pfn · d1fe9219
      Paolo Bonzini authored
      
      
      The call to get_mt_mask was really using kvm_is_reserved_pfn to
      detect an MMIO-backed page.  In this case, we want "false" to be
      returned for the zero page.
      
      Reintroduce a separate kvm_is_mmio_pfn predicate for this use
      only.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1fe9219
    • Paolo Bonzini's avatar
      5d75a747
    • Ralf Baechle's avatar
      MIPS: O32: Use compat_sys_getsockopt. · 51d53674
      Ralf Baechle authored
      
      
      We were using the native syscall and that results in subtle breakage.
      
      This is the same issue as fixed in 077d0e65
      (MIPS: N32: Use compat getsockopt syscall) but that commit did fix it only
      for N32.
      
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100291
      51d53674
    • Paul Burton's avatar
      MIPS: c-r4k: Extend way_string array · 1e18ac7a
      Paul Burton authored
      
      
      The L2 cache in the I6400 core has 16 ways, so extend the way_string
      array to take such caches into account.
      
      [ralf@linux-mips.org: Other already supported CPUs are free to support
      more than 8 ways of cache as well.]
      
      Signed-off-by: default avatarPaul Burton <paul.burton@imgtec.com>
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/10640/
      
      
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      1e18ac7a
    • James Hogan's avatar
      MIPS: Pistachio: Support CDMM & Fast Debug Channel · 6b5e741e
      James Hogan authored
      
      
      Implement the mips_cdmm_phys_base() platform callback to provide a
      default Common Device Memory Map (CDMM) physical base address for the
      Pistachio SoC. This allows the CDMM in each VPE to be configured and
      probed for devices, such as the Fast Debug Channel (FDC).
      
      The physical address chosen is just below the default CPC address, which
      appears to also be unallocated.
      
      The FDC IRQ is also usable on Pistachio, and is routed through the GIC,
      so implement the get_c0_fdc_int() platform callback using
      gic_get_c0_fdc_int(), so the FDC driver doesn't have to fall back to
      polling.
      
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andrew Bresticker <abrestic@chromium.org>
      Cc: James Hartley <james.hartley@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Reviewed-by: default avatarAndrew Bresticker <abrestic@chromium.org>
      Patchwork: http://patchwork.linux-mips.org/patch/9749/
      
      
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      6b5e741e
Loading