Skip to content
  1. Aug 23, 2018
  2. Aug 22, 2018
    • Ard Biesheuvel's avatar
      module: use relative references for __ksymtab entries · 7290d580
      Ard Biesheuvel authored
      An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries,
      each consisting of two 64-bit fields containing absolute references, to
      the symbol itself and to a char array containing its name, respectively.
      
      When we build the same configuration with KASLR enabled, we end up with an
      additional ~192 KB of relocations in the .init section, i.e., one 24 byte
      entry for each absolute reference, which all need to be processed at boot
      time.
      
      Given how the struct kernel_symbol that describes each entry is completely
      local to module.c (except for the references emitted by EXPORT_SYMBOL()
      itself), we can easily modify it to contain two 32-bit relative references
      instead.  This reduces the size of the __ksymtab section by 50% for all
      64-bit architectures, and gets rid of the runtime relocations entirely for
      architectures implementing KASLR, either via standard PIE linking (arm64)
      or using custom host tools (x86).
      
      Note that the binary search involving __ksymtab contents relies on each
      section being sorted by symbol name.  This is implemented based on the
      input section names, not the names in the ksymtab entries, so this patch
      does not interfere with that.
      
      Given that the use of place-relative relocations requires support both in
      the toolchain and in the module loader, we cannot enable this feature for
      all architectures.  So make it dependent on whether
      CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined.
      
      Link: http://lkml.kernel.org/r/20180704083651.24360-4-ard.biesheuvel@linaro.org
      
      
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: default avatarJessica Yu <jeyu@kernel.org>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarWill Deacon <will.deacon@arm.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morris <james.morris@microsoft.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7290d580
    • Sean Christopherson's avatar
      KVM: vmx: Add defines for SGX ENCLS exiting · 802ec461
      Sean Christopherson authored
      
      
      Hardware support for basic SGX virtualization adds a new execution
      control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit
      reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf
      functions, e.g. to inject faults when the VMM isn't exposing SGX to
      a VM.  When ENCLS_EXITING is enabled, the VMM can set/clear bits in
      the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g.
      setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT]
      to VMExit(ENCLS).
      
      Note: EXIT_REASON_ENCLS was previously added by commit 1f519992
      ("KVM: VMX: add missing exit reasons").
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20180814163334.25724-2-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      802ec461
  3. Aug 17, 2018
    • Sean Christopherson's avatar
      x86/speculation/l1tf: Exempt zeroed PTEs from inversion · f19f5c49
      Sean Christopherson authored
      
      
      It turns out that we should *not* invert all not-present mappings,
      because the all zeroes case is obviously special.
      
      clear_page() does not undergo the XOR logic to invert the address bits,
      i.e. PTE, PMD and PUD entries that have not been individually written
      will have val=0 and so will trigger __pte_needs_invert(). As a result,
      {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
      (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
      because the page at physical address 0 is reserved early in boot
      specifically to mitigate L1TF, so explicitly exempt them from the
      inversion when reading the PFN.
      
      Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
      on a VMA that has VM_PFNMAP and was mmap'd to as something other than
      PROT_NONE but never used. mprotect() sends the PROT_NONE request down
      prot_none_walk(), which walks the PTEs to check the PFNs.
      prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
      -EACCES because it thinks mprotect() is trying to adjust a high MMIO
      address.
      
      [ This is a very modified version of Sean's original patch, but all
        credit goes to Sean for doing this and also pointing out that
        sometimes the __pte_needs_invert() function only gets the protection
        bits, not the full eventual pte.  But zero remains special even in
        just protection bits, so that's ok.   - Linus ]
      
      Fixes: f22cc87f ("x86/speculation/l1tf: Invert all not present mappings")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarAndi Kleen <ak@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f19f5c49
  4. Aug 15, 2018
  5. Aug 10, 2018
    • Joerg Roedel's avatar
      x86/mm/pti: Move user W+X check into pti_finalize() · d878efce
      Joerg Roedel authored
      
      
      The user page-table gets the updated kernel mappings in pti_finalize(),
      which runs after the RO+X permissions got applied to the kernel page-table
      in mark_readonly().
      
      But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
      mark_readonly() for insecure mappings.  This causes false-positive
      warnings, because the user page-table did not get the updated mappings yet.
      
      Move the W+X check for the user page-table into pti_finalize() after it
      updated all required mappings.
      
      [ tglx: Folded !NX supported fix ]
      
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
      Cc: joro@8bytes.org
      Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
      d878efce
  6. Aug 08, 2018
  7. Aug 07, 2018
  8. Aug 06, 2018
    • Dave Hansen's avatar
      x86/mm/init: Remove freed kernel image areas from alias mapping · c40a56a7
      Dave Hansen authored
      
      
      The kernel image is mapped into two places in the virtual address space
      (addresses without KASLR, of course):
      
      	1. The kernel direct map (0xffff880000000000)
      	2. The "high kernel map" (0xffffffff81000000)
      
      We actually execute out of #2.  If we get the address of a kernel symbol,
      it points to #2, but almost all physical-to-virtual translations point to
      
      Parts of the "high kernel map" alias are mapped in the userspace page
      tables with the Global bit for performance reasons.  The parts that we map
      to userspace do not (er, should not) have secrets. When PTI is enabled then
      the global bit is usually not set in the high mapping and just used to
      compensate for poor performance on systems which lack PCID.
      
      This is fine, except that some areas in the kernel image that are adjacent
      to the non-secret-containing areas are unused holes.  We free these holes
      back into the normal page allocator and reuse them as normal kernel memory.
      The memory will, of course, get *used* via the normal map, but the alias
      mapping is kept.
      
      This otherwise unused alias mapping of the holes will, by default keep the
      Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
      
      Remove the alias mapping of these pages entirely.  This is likely to
      fracture the 2M page mapping the kernel image near these areas, but this
      should affect a minority of the area.
      
      The pageattr code changes *all* aliases mapping the physical pages that it
      operates on (by default).  We only want to modify a single alias, so we
      need to tweak its behavior.
      
      This unmapping behavior is currently dependent on PTI being in place.
      Going forward, we should at least consider doing this for all
      configurations.  Having an extra read-write alias for memory is not exactly
      ideal for debugging things like random memory corruption and this does
      undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
      Frame Ownership (XPFO).
      
      Before this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
        current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      After this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
      current_kernel-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
      current_kernel-0xffffffff82488000-0xffffffff82600000        1504K                               pte
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82c0d000          52K     RW                     NX pte
      current_kernel-0xffffffff82c0d000-0xffffffff82dc0000        1740K                               pte
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
        current_user-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
        current_user-0xffffffff82488000-0xffffffff82600000        1504K                               pte
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      [ tglx: Do not unmap on 32bit as there is only one mapping ]
      
      Fixes: 0f561fce ("x86/pti: Enable global pages for shared areas")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
      c40a56a7
    • Wanpeng Li's avatar
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li authored
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141
      
      ):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
    • Tianyu Lan's avatar
      KVM: x86: Add tlb remote flush callback in kvm_x86_ops. · b08660e5
      Tianyu Lan authored
      
      
      This patch is to provide a way for platforms to register hv tlb remote
      flush callback and this helps to optimize operation of tlb flush
      among vcpus for nested virtualization case.
      
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b08660e5
    • Tianyu Lan's avatar
      X86/Hyper-V: Add hyperv_nested_flush_guest_mapping ftrace support · 60cfce4c
      Tianyu Lan authored
      
      
      This patch is to add hyperv_nested_flush_guest_mapping support to trace
      hvFlushGuestPhysicalAddressSpace hypercall.
      
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60cfce4c
    • Tianyu Lan's avatar
      X86/Hyper-V: Add flush HvFlushGuestPhysicalAddressSpace hypercall support · eb914cfe
      Tianyu Lan authored
      
      
      Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
      flush nested VM address space mapping in l1 hypervisor and it's to
      reduce overhead of flushing ept tlb among vcpus. This patch is to
      implement it.
      
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eb914cfe
    • Junaid Shahid's avatar
      kvm: x86: Remove CR3_PCID_INVD flag · 208320ba
      Junaid Shahid authored
      
      
      It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      208320ba
    • Junaid Shahid's avatar
      kvm: x86: Add multi-entry LRU cache for previous CR3s · b94742c9
      Junaid Shahid authored
      
      
      Adds support for storing multiple previous CR3/root_hpa pairs maintained
      as an LRU cache, so that the lockless CR3 switch path can be used when
      switching back to any of them.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b94742c9
    • Junaid Shahid's avatar
      kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg* · faff8758
      Junaid Shahid authored
      
      
      This needs a minor bug fix. The updated patch is as follows.
      
      Thanks,
      Junaid
      
      ------------------------------------------------------------------------------
      
      kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
      entries for the specific guest virtual address, instead of flushing all
      TLB entries associated with the VM.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      faff8758
    • Junaid Shahid's avatar
      kvm: x86: Support selectively freeing either current or previous MMU root · 08fb59d8
      Junaid Shahid authored
      
      
      kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
      that either one of the roots (active/previous) can be individually freed
      when needed.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      08fb59d8
    • Junaid Shahid's avatar
      kvm: x86: Add a root_hpa parameter to kvm_mmu->invlpg() · 7eb77e9f
      Junaid Shahid authored
      
      
      This allows invlpg() to be called using either the active root_hpa
      or the prev_root_hpa.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7eb77e9f
    • Junaid Shahid's avatar
      kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest · ade61e28
      Junaid Shahid authored
      
      
      When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
      instruction indicates that the TLB doesn't need to be flushed.
      
      This change enables this optimization for MOV-to-CR3s in the guest
      that have been intercepted by KVM for shadow paging and are handled
      within the fast CR3 switch path.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ade61e28
    • Junaid Shahid's avatar
      kvm: vmx: Support INVPCID in shadow paging mode · eb4b248e
      Junaid Shahid authored
      
      
      Implement support for INVPCID in shadow paging mode as well.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eb4b248e
    • Junaid Shahid's avatar
      kvm: x86: Introduce KVM_REQ_LOAD_CR3 · 6e42782f
      Junaid Shahid authored
      
      
      The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
      current root_hpa.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6e42782f
    • Junaid Shahid's avatar
      kvm: x86: Add fast CR3 switch code path · 7c390d35
      Junaid Shahid authored
      
      
      When using shadow paging, a CR3 switch in the guest results in a VM Exit.
      In the common case, that VM exit doesn't require much processing by KVM.
      However, it does acquire the MMU lock, which can start showing signs of
      contention under some workloads even on a 2 VCPU VM when the guest is
      using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
      lock in the most common cases e.g. when switching back and forth between
      the kernel and user mode CR3s used by KPTI with no guest page table
      changes in between.
      
      For now, this fast path is implemented only for 64-bit guests and hosts
      to avoid the handling of PDPTEs, but it can be extended later to 32-bit
      guests and/or hosts as well.
      
      Signed-off-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c390d35
    • Jim Mattson's avatar
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson authored
      
      
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
    • Paolo Bonzini's avatar
      KVM: x86: do not load vmcs12 pages while still in SMM · 7f7f1ba3
      Paolo Bonzini authored
      
      
      If the vCPU enters system management mode while running a nested guest,
      RSM starts processing the vmentry while still in SMM.  In that case,
      however, the pages pointed to by the vmcs12 might be incorrectly
      loaded from SMRAM.  To avoid this, delay the handling of the pages
      until just before the next vmentry.  This is done with a new request
      and a new entry in kvm_x86_ops, which we will be able to reuse for
      nested VMX state migration.
      
      Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7f7f1ba3
  9. Aug 05, 2018
    • Nick Desaulniers's avatar
      x86/irqflags: Provide a declaration for native_save_fl · 208cbb32
      Nick Desaulniers authored
      
      
      It was reported that the commit d0a8d937 is causing users of gcc < 4.9
      to observe -Werror=missing-prototypes errors.
      
      Indeed, it seems that:
      extern inline unsigned long native_save_fl(void) { return 0; }
      
      compiled with -Werror=missing-prototypes produces this warning in gcc <
      4.9, but not gcc >= 4.9.
      
      Fixes: d0a8d937 ("x86/paravirt: Make native_save_fl() extern inline").
      Reported-by: default avatarDavid Laight <david.laight@aculab.com>
      Reported-by: default avatarJean Delvare <jdelvare@suse.de>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: jgross@suse.com
      Cc: kstewart@linuxfoundation.org
      Cc: gregkh@linuxfoundation.org
      Cc: boris.ostrovsky@oracle.com
      Cc: astrachan@google.com
      Cc: mka@chromium.org
      Cc: arnd@arndb.de
      Cc: tstellar@redhat.com
      Cc: sedat.dilek@gmail.com
      Cc: David.Laight@aculab.com
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180803170550.164688-1-ndesaulniers@google.com
      208cbb32
    • Dave Hansen's avatar
      x86/mm/init: Add helper for freeing kernel image pages · 6ea2738e
      Dave Hansen authored
      
      
      When chunks of the kernel image are freed, free_init_pages() is used
      directly.  Consolidate the three sites that do this.  Also update the
      string to give an incrementally better description of that memory versus
      what was there before.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
      6ea2738e
    • Paolo Bonzini's avatar
      KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf
      Paolo Bonzini authored
      
      
      When nested virtualization is in use, VMENTER operations from the nested
      hypervisor into the nested guest will always be processed by the bare metal
      hypervisor, and KVM's "conditional cache flushes" mode in particular does a
      flush on nested vmentry.  Therefore, include the "skip L1D flush on
      vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
      
      Add the relevant Documentation.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      5b76a3cf
    • Paolo Bonzini's avatar
      x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry · 8e0b2b91
      Paolo Bonzini authored
      
      
      Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
      not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
      either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      8e0b2b91
    • Nicolai Stange's avatar
      x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d · ffcba43f
      Nicolai Stange authored
      
      
      The last missing piece to having vmx_l1d_flush() take interrupts after
      VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
      irq entry.
      
      Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
      ipi_entering_ack_irq(), smp_reschedule_interrupt() and
      uv_bau_message_interrupt().
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      ffcba43f
    • Nicolai Stange's avatar
      x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316
      Nicolai Stange authored
      
      
      The next patch in this series will have to make the definition of
      irq_cpustat_t available to entering_irq().
      
      Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
      dependencies like
      
        asm/smp.h
          asm/apic.h
            asm/hardirq.h
              linux/irq.h
                linux/topology.h
                  linux/smp.h
                    asm/smp.h
      
      or
      
        linux/gfp.h
          linux/mmzone.h
            asm/mmzone.h
              asm/mmzone_64.h
                asm/smp.h
                  asm/apic.h
                    asm/hardirq.h
                      linux/irq.h
                        linux/irqdesc.h
                          linux/kobject.h
                            linux/sysfs.h
                              linux/kernfs.h
                                linux/idr.h
                                  linux/gfp.h
      
      and others.
      
      This causes compilation errors because of the header guards becoming
      effective in the second inclusion: symbols/macros that had been defined
      before wouldn't be available to intermediate headers in the #include chain
      anymore.
      
      A possible workaround would be to move the definition of irq_cpustat_t
      into its own header and include that from both, asm/hardirq.h and
      asm/apic.h.
      
      However, this wouldn't solve the real problem, namely asm/harirq.h
      unnecessarily pulling in all the linux/irq.h cruft: nothing in
      asm/hardirq.h itself requires it. Also, note that there are some other
      archs, like e.g. arm64, which don't have that #include in their
      asm/hardirq.h.
      
      Remove the linux/irq.h #include from x86' asm/hardirq.h.
      
      Fix resulting compilation errors by adding appropriate #includes to *.c
      files as needed.
      
      Note that some of these *.c files could be cleaned up a bit wrt. to their
      set of #includes, but that should better be done from separate patches, if
      at all.
      
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      447ae316
    • Nicolai Stange's avatar
      x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0
      Nicolai Stange authored
      
      
      Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
      VMENTRY.
      
      L1D flushes are costly and two modes of operations are provided to users:
      "always" and the more selective "conditional" mode.
      
      If operating in the latter, the cache would get flushed only if a host side
      code path considered unconfined had been traversed. "Unconfined" in this
      context means that it might have pulled in sensitive data like user data
      or kernel crypto keys.
      
      The need for L1D flushes is tracked by means of the per-vcpu flag
      l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
      vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
      L1d flush based on its value and reset that flag again.
      
      Currently, interrupts delivered "normally" while in root operation between
      VMEXIT and VMENTER are not taken into account. Part of the reason is that
      these don't leave any traces and thus, the vmx code is unable to tell if
      any such has happened.
      
      As proposed by Paolo Bonzini, prepare for tracking all interrupts by
      introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
      strong analogy to the per-vcpu ->l1tf_flush_l1d.
      
      A later patch will make interrupt handlers set it.
      
      For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
      per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
      
      Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
      kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
      trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
      
      Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
      l1tf_flush_l1d.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      45b575c0
    • Nicolai Stange's avatar
      x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 · 9aee5f8a
      Nicolai Stange authored
      
      
      An upcoming patch will extend KVM's L1TF mitigation in conditional mode
      to also cover interrupts after VMEXITs. For tracking those, stores to a
      new per-cpu flag from interrupt handlers will become necessary.
      
      In order to improve cache locality, this new flag will be added to x86's
      irq_cpustat_t.
      
      Make some space available there by shrinking the ->softirq_pending bitfield
      from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
      i.e. 10.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarNicolai Stange <nstange@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9aee5f8a
  10. Aug 03, 2018
    • Sai Praneeth's avatar
      x86/speculation: Support Enhanced IBRS on future CPUs · 706d5168
      Sai Praneeth authored
      Future Intel processors will support "Enhanced IBRS" which is an "always
      on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never
      disabled.
      
      From the specification [1]:
      
       "With enhanced IBRS, the predicted targets of indirect branches
        executed cannot be controlled by software that was executed in a less
        privileged predictor mode or on another logical processor. As a
        result, software operating on a processor with enhanced IBRS need not
        use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more
        privileged predictor mode. Software can isolate predictor modes
        effectively simply by setting the bit once. Software need not disable
        enhanced IBRS prior to entering a sleep state such as MWAIT or HLT."
      
      If Enhanced IBRS is supported by the processor then use it as the
      preferred spectre v2 mitigation mechanism instead of Retpoline. Intel's
      Retpoline white paper [2] states:
      
       "Retpoline is known to be an effective branch target injection (Spectre
        variant 2) mitigation on Intel processors belonging to family 6
        (enumerated by the CPUID instruction) that do not have support for
        enhanced IBRS. On processors that support enhanced IBRS, it should be
        used for mitigation instead of retpoline."
      
      The reason why Enhanced IBRS is the recommended mitigation on processors
      which support it is that these processors also support CET which
      provides a defense against ROP attacks. Retpoline is very similar to ROP
      techniques and might trigger false positives in the CET defense.
      
      If Enhanced IBRS is selected as the mitigation technique for spectre v2,
      the IBRS bit in SPEC_CTRL MSR is set once at boot time and never
      cleared. Kernel also has to make sure that IBRS bit remains set after
      VMEXIT because the guest might have cleared the bit. This is already
      covered by the existing x86_spec_ctrl_set_guest() and
      x86_spec_ctrl_restore_host() speculation control functions.
      
      Enhanced IBRS still requires IBPB for full mitigation.
      
      [1] Speculative-Execution-Side-Channel-Mitigations.pdf
      [2] Retpoline-A-Branch-Target-Injection-Mitigation.pdf
      Both documents are available at:
      https://bugzilla.kernel.org/show_bug.cgi?id=199511
      
      
      
      Originally-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarSai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Tim C Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1533148945-24095-1-git-send-email-sai.praneeth.prakhya@intel.com
      706d5168
    • Peter Feiner's avatar
      x86/cpufeatures: Add EPT_AD feature bit · 301d328a
      Peter Feiner authored
      
      
      Some Intel processors have an EPT feature whereby the accessed & dirty bits
      in EPT entries can be updated by HW. MSR IA32_VMX_EPT_VPID_CAP exposes the
      presence of this capability.
      
      There is no point in trying to use that new feature bit in the VMX code as
      VMX needs to read the MSR anyway to access other bits, but having the
      feature bit for EPT_AD in place helps virtualization management as it
      exposes "ept_ad" in /proc/cpuinfo/$proc/flags if the feature is present.
      
      [ tglx: Amended changelog ]
      
      Signed-off-by: default avatarPeter Feiner <pfeiner@google.com>
      Signed-off-by: default avatarPeter Shier <pshier@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/20180801180657.138051-1-pshier@google.com
      301d328a
  11. Jul 29, 2018
  12. Jul 25, 2018
Loading