Skip to content
  1. Feb 10, 2019
    • Juergen Gross's avatar
      x86/mm: Make set_pmd_at() paravirt aware · 20e55bc1
      Juergen Gross authored
      
      
      set_pmd_at() calls native_set_pmd() unconditionally on x86. This was
      fine as long as only huge page entries were written via set_pmd_at(),
      as Xen pv guests don't support those.
      
      Commit 2c91bd4a ("mm: speed up mremap by 20x on large regions")
      introduced a usage of set_pmd_at() possible on pv guests, leading to
      failures like:
      
      BUG: unable to handle kernel paging request at ffff888023e26778
      #PF error: [PROT] [WRITE]
      RIP: e030:move_page_tables+0x7c1/0xae0
      move_vma.isra.3+0xd1/0x2d0
      __se_sys_mremap+0x3c6/0x5b0
       do_syscall_64+0x49/0x100
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Make set_pmd_at() paravirt aware by just letting it use set_pmd().
      
      Fixes: 2c91bd4a ("mm: speed up mremap by 20x on large regions")
      Reported-by: default avatarSander Eikelenboom <linux@eikelenboom.it>
      Signed-off-by: default avatarJuergen Gross <jgross@suse.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: xen-devel@lists.xenproject.org
      Cc: boris.ostrovsky@oracle.com
      Cc: sstabellini@kernel.org
      Cc: hpa@zytor.com
      Cc: bp@alien8.de
      Cc: torvalds@linux-foundation.org
      Link: https://lkml.kernel.org/r/20190210074056.11842-1-jgross@suse.com
      20e55bc1
  2. Feb 02, 2019
    • Johannes Weiner's avatar
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner authored
      
      
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  3. Jan 29, 2019
  4. Jan 15, 2019
    • Dave Hansen's avatar
      x86/pkeys: Properly copy pkey state at fork() · a31e184e
      Dave Hansen authored
      
      
      Memory protection key behavior should be the same in a child as it was
      in the parent before a fork.  But, there is a bug that resets the
      state in the child at fork instead of preserving it.
      
      The creation of new mm's is a bit convoluted.  At fork(), the code
      does:
      
        1. memcpy() the parent mm to initialize child
        2. mm_init() to initalize some select stuff stuff
        3. dup_mmap() to create true copies that memcpy() did not do right
      
      For pkeys two bits of state need to be preserved across a fork:
      'execute_only_pkey' and 'pkey_allocation_map'.
      
      Those are preserved by the memcpy(), but mm_init() invokes
      init_new_context() which overwrites 'execute_only_pkey' and
      'pkey_allocation_map' with "new" values.
      
      The author of the code erroneously believed that init_new_context is *only*
      called at execve()-time.  But, alas, init_new_context() is used at execve()
      and fork().
      
      The result is that, after a fork(), the child's pkey state ends up looking
      like it does after an execve(), which is totally wrong.  pkeys that are
      already allocated can be allocated again, for instance.
      
      To fix this, add code called by dup_mmap() to copy the pkey state from
      parent to child explicitly.  Also add a comment above init_new_context() to
      make it more clear to the next poor sod what this code is used for.
      
      Fixes: e8c24d3a ("x86/pkeys: Allocation/free syscalls")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: peterz@infradead.org
      Cc: mpe@ellerman.id.au
      Cc: will.deacon@arm.com
      Cc: luto@kernel.org
      Cc: jroedel@suse.de
      Cc: stable@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Link: https://lkml.kernel.org/r/20190102215655.7A69518C@viggo.jf.intel.com
      a31e184e
  5. Jan 09, 2019
  6. Jan 06, 2019
  7. Jan 05, 2019
    • Linus Torvalds's avatar
      x86: re-introduce non-generic memcpy_{to,from}io · 170d13ca
      Linus Torvalds authored
      
      
      This has been broken forever, and nobody ever really noticed because
      it's purely a performance issue.
      
      Long long ago, in commit 6175ddf0 ("x86: Clean up mem*io functions")
      Brian Gerst simplified the memory copies to and from iomem, since on
      x86, the instructions to access iomem are exactly the same as the
      regular instructions.
      
      That is technically true, and things worked, and nobody said anything.
      Besides, back then the regular memcpy was pretty simple and worked fine.
      
      Nobody noticed except for David Laight, that is.  David has a testing a
      TLP monitor he was writing for an FPGA, and has been occasionally
      complaining about how memcpy_toio() writes things one byte at a time.
      
      Which is completely unacceptable from a performance standpoint, even if
      it happens to technically work.
      
      The reason it's writing one byte at a time is because while it's
      technically true that accesses to iomem are the same as accesses to
      regular memory on x86, the _granularity_ (and ordering) of accesses
      matter to iomem in ways that they don't matter to regular cached memory.
      
      In particular, when ERMS is set, we default to using "rep movsb" for
      larger memory copies.  That is indeed perfectly fine for real memory,
      since the whole point is that the CPU is going to do cacheline
      optimizations and executes the memory copy efficiently for cached
      memory.
      
      With iomem? Not so much.  With iomem, "rep movsb" will indeed work, but
      it will copy things one byte at a time. Slowly and ponderously.
      
      Now, originally, back in 2010 when commit 6175ddf0 was done, we
      didn't use ERMS, and this was much less noticeable.
      
      Our normal memcpy() was simpler in other ways too.
      
      Because in fact, it's not just about using the string instructions.  Our
      memcpy() these days does things like "read and write overlapping values"
      to handle the last bytes of the copy.  Again, for normal memory,
      overlapping accesses isn't an issue.  For iomem? It can be.
      
      So this re-introduces the specialized memcpy_toio(), memcpy_fromio() and
      memset_io() functions.  It doesn't particularly optimize them, but it
      tries to at least not be horrid, or do overlapping accesses.  In fact,
      this uses the existing __inline_memcpy() function that we still had
      lying around that uses our very traditional "rep movsl" loop followed by
      movsw/movsb for the final bytes.
      
      Somebody may decide to try to improve on it, but if we've gone almost a
      decade with only one person really ever noticing and complaining, maybe
      it's not worth worrying about further, once it's not _completely_ broken?
      
      Reported-by: default avatarDavid Laight <David.Laight@aculab.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      170d13ca
    • Linus Torvalds's avatar
      Use __put_user_goto in __put_user_size() and unsafe_put_user() · a959dc88
      Linus Torvalds authored
      
      
      This actually enables the __put_user_goto() functionality in
      unsafe_put_user().
      
      For an example of the effect of this, this is the code generated for the
      
              unsafe_put_user(signo, &infop->si_signo, Efault);
      
      in the waitid() system call:
      
      	movl %ecx,(%rbx)        # signo, MEM[(struct __large_struct *)_2]
      
      It's just one single store instruction, along with generating an
      exception table entry pointing to the Efault label case in case that
      instruction faults.
      
      Before, we would generate this:
      
      	xorl    %edx, %edx
      	movl %ecx,(%rbx)        # signo, MEM[(struct __large_struct *)_3]
              testl   %edx, %edx
              jne     .L309
      
      with the exception table generated for that 'mov' instruction causing us
      to jump to a stub that set %edx to -EFAULT and then jumped back to the
      'testl' instruction.
      
      So not only do we now get rid of the extra code in the normal sequence,
      we also avoid unnecessarily keeping that extra error register live
      across it all.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a959dc88
    • Linus Torvalds's avatar
      x86 uaccess: Introduce __put_user_goto · 4a789213
      Linus Torvalds authored
      
      
      This is finally the actual reason for the odd error handling in the
      "unsafe_get/put_user()" functions, introduced over three years ago.
      
      Using a "jump to error label" interface is somewhat odd, but very
      convenient as a programming interface, and more importantly, it fits
      very well with simply making the target be the exception handler address
      directly from the inline asm.
      
      The reason it took over three years to actually do this? We need "asm
      goto" support for it, which only became the default on x86 last year.
      It's now been a year that we've forced asm goto support (see commit
      e501ce95 "x86: Force asm-goto"), and so let's just do it here too.
      
      [ Side note: this commit was originally done back in 2016. The above
        commentary about timing is obviously about it only now getting merged
        into my real upstream tree     - Linus ]
      
      Sadly, gcc still only supports "asm goto" with asms that do not have any
      outputs, so we are limited to only the put_user case for this.  Maybe in
      several more years we can do the get_user case too.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a789213
  8. Jan 04, 2019
    • Joel Fernandes (Google)'s avatar
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) authored
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
      
      
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
    • Matthew Wilcox's avatar
      fls: change parameter to unsigned int · 3fc2579e
      Matthew Wilcox authored
      When testing in userspace, UBSAN pointed out that shifting into the sign
      bit is undefined behaviour.  It doesn't really make sense to ask for the
      highest set bit of a negative value, so just turn the argument type into
      an unsigned int.
      
      Some architectures (eg ppc) already had it declared as an unsigned int,
      so I don't expect too many problems.
      
      Link: http://lkml.kernel.org/r/20181105221117.31828-1-willy@infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3fc2579e
    • Linus Torvalds's avatar
      make 'user_access_begin()' do 'access_ok()' · 594cc251
      Linus Torvalds authored
      
      
      Originally, the rule used to be that you'd have to do access_ok()
      separately, and then user_access_begin() before actually doing the
      direct (optimized) user access.
      
      But experience has shown that people then decide not to do access_ok()
      at all, and instead rely on it being implied by other operations or
      similar.  Which makes it very hard to verify that the access has
      actually been range-checked.
      
      If you use the unsafe direct user accesses, hardware features (either
      SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
      Access Never - on ARM) do force you to use user_access_begin().  But
      nothing really forces the range check.
      
      By putting the range check into user_access_begin(), we actually force
      people to do the right thing (tm), and the range check vill be visible
      near the actual accesses.  We have way too long a history of people
      trying to avoid them.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      594cc251
    • Linus Torvalds's avatar
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds authored
      
      
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  9. Dec 28, 2018
  10. Dec 21, 2018
    • Robert Hoo's avatar
    • Sean Christopherson's avatar
      KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup · e8143499
      Sean Christopherson authored
      
      
      ____kvm_handle_fault_on_reboot() provides a generic exception fixup
      handler that is used to cleanly handle faults on VMX/SVM instructions
      during reboot (or at least try to).  If there isn't a reboot in
      progress, ____kvm_handle_fault_on_reboot() treats any exception as
      fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
      a BUG() to get a stack trace and die.
      
      When it was originally added by commit 4ecac3fd ("KVM: Handle
      virtualization instruction #UD faults during reboot"), the "call" to
      kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
      is the RIP of the faulting instructing.
      
      The PUSH+JMP trickery is necessary because the exception fixup handler
      code lies outside of its associated function, e.g. right after the
      function.  An actual CALL from the .fixup code would show a slightly
      bogus stack trace, e.g. an extra "random" function would be inserted
      into the trace, as the return RIP on the stack would point to no known
      function (and the unwinder will likely try to guess who owns the RIP).
      
      Unfortunately, the JMP was replaced with a CALL when the macro was
      reworked to not spin indefinitely during reboot (commit b7c4145b
      "KVM: Don't spin on virt instruction faults during reboot").  This
      causes the aforementioned behavior where a bogus function is inserted
      into the stack trace, e.g. my builds like to blame free_kvm_area().
      
      Revert the CALL back to a JMP.  The changelog for commit b7c4145b
      ("KVM: Don't spin on virt instruction faults during reboot") contains
      nothing that indicates the switch to CALL was deliberate.  This is
      backed up by the fact that the PUSH <insn RIP> was left intact.
      
      Note that an alternative to the PUSH+JMP magic would be to JMP back
      to the "real" code and CALL from there, but that would require adding
      a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
      and would add no value, i.e. the stack trace would be the same.
      
      Using CALL:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
      R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
      FS:  00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
      Call Trace:
       free_kvm_area+0x1044/0x43ea [kvm_intel]
       ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace 9775b14b123b1713 ]---
      
      Using JMP:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
      R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
      FS:  00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
      Call Trace:
       vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace f9daedb85ab3ddba ]---
      
      Fixes: b7c4145b ("KVM: Don't spin on virt instruction faults during reboot")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8143499
    • Uros Bizjak's avatar
      KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams · ac5ffda2
      Uros Bizjak authored
      
      
      Recently the minimum required version of binutils was changed to 2.20,
      which supports all SVM instruction mnemonics. The patch removes
      all .byte #defines and uses real instruction mnemonics instead.
      
      Signed-off-by: default avatarUros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ac5ffda2
    • Lan Tianyu's avatar
      KVM: Make kvm_set_spte_hva() return int · 748c0e31
      Lan Tianyu authored
      
      
      The patch is to make kvm_set_spte_hva() return int and caller can
      check return value to determine flush tlb or not.
      
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      748c0e31
    • Lan Tianyu's avatar
      x86/hyper-v: Add HvFlushGuestAddressList hypercall support · cc4edae4
      Lan Tianyu authored
      
      
      Hyper-V provides HvFlushGuestAddressList() hypercall to flush EPT tlb
      with specified ranges. This patch is to add the hypercall support.
      
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cc4edae4
    • Lan Tianyu's avatar
      KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops · a49b9635
      Lan Tianyu authored
      
      
      Add flush range call back in the kvm_x86_ops and platform can use it
      to register its associated function. The parameter "kvm_tlb_range"
      accepts a single range and flush list which contains a list of ranges.
      
      Signed-off-by: default avatarLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a49b9635
    • Chao Peng's avatar
      KVM: x86: Add Intel Processor Trace cpuid emulation · 86f5201d
      Chao Peng authored
      
      
      Expose Intel Processor Trace to guest only when
      the PT works in Host-Guest mode.
      
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86f5201d
    • Chao Peng's avatar
      KVM: x86: Add Intel PT virtualization work mode · f99e3daf
      Chao Peng authored
      
      
      Intel Processor Trace virtualization can be work in one
      of 2 possible modes:
      
      a. System-Wide mode (default):
         When the host configures Intel PT to collect trace packets
         of the entire system, it can leave the relevant VMX controls
         clear to allow VMX-specific packets to provide information
         across VMX transitions.
         KVM guest will not aware this feature in this mode and both
         host and KVM guest trace will output to host buffer.
      
      b. Host-Guest mode:
         Host can configure trace-packet generation while in
         VMX non-root operation for guests and root operation
         for native executing normally.
         Intel PT will be exposed to KVM guest in this mode, and
         the trace output to respective buffer of host and guest.
         In this mode, tht status of PT will be saved and disabled
         before VM-entry and restored after VM-exit if trace
         a virtual machine.
      
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f99e3daf
    • Luwei Kang's avatar
      perf/x86/intel/pt: add new capability for Intel PT · e0018afe
      Luwei Kang authored
      
      
      This adds support for "output to Trace Transport subsystem"
      capability of Intel PT. It means that PT can output its
      trace to an MMIO address range rather than system memory buffer.
      
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0018afe
    • Luwei Kang's avatar
      perf/x86/intel/pt: Add new bit definitions for PT MSRs · 69843a91
      Luwei Kang authored
      
      
      Add bit definitions for Intel PT MSRs to support trace output
      directed to the memeory subsystem and holds a count if packet
      bytes that have been sent out.
      
      These are required by the upcoming PT support in KVM guests
      for MSRs read/write emulation.
      
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      69843a91
    • Luwei Kang's avatar
      perf/x86/intel/pt: Introduce intel_pt_validate_cap() · 61be2998
      Luwei Kang authored
      
      
      intel_pt_validate_hw_cap() validates whether a given PT capability is
      supported by the hardware. It checks the PT capability array which
      reflects the capabilities of the hardware on which the code is executed.
      
      For setting up PT for KVM guests this is not correct as the capability
      array for the guest can be different from the host array.
      
      Provide a new function to check against a given capability array.
      
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61be2998
    • Chao Peng's avatar
      perf/x86/intel/pt: Export pt_cap_get() · f6d079ce
      Chao Peng authored
      
      
      pt_cap_get() is required by the upcoming PT support in KVM guests.
      
      Export it and move the capabilites enum to a global header.
      
      As a global functions, "pt_*" is already used for ptrace and
      other things, so it makes sense to use "intel_pt_*" as a prefix.
      
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f6d079ce
    • Chao Peng's avatar
      perf/x86/intel/pt: Move Intel PT MSRs bit defines to global header · 887eda13
      Chao Peng authored
      
      
      The Intel Processor Trace (PT) MSR bit defines are in a private
      header. The upcoming support for PT virtualization requires these defines
      to be accessible from KVM code.
      
      Move them to the global MSR header file.
      
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: default avatarLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      887eda13
  11. Dec 20, 2018
  12. Dec 19, 2018
  13. Dec 18, 2018
    • Eduardo Habkost's avatar
      kvm: x86: Add AMD's EX_CFG to the list of ignored MSRs · 0e1b869f
      Eduardo Habkost authored
      
      
      Some guests OSes (including Windows 10) write to MSR 0xc001102c
      on some cases (possibly while trying to apply a CPU errata).
      Make KVM ignore reads and writes to that MSR, so the guest won't
      crash.
      
      The MSR is documented as "Execution Unit Configuration (EX_CFG)",
      at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
      15h Models 00h-0Fh Processors".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0e1b869f
    • Chang S. Bae's avatar
      x86/fsgsbase/64: Fix the base write helper functions · 87ab4689
      Chang S. Bae authored
      
      
      Andy spotted a regression in the fs/gs base helpers after the patch series
      was committed. The helper functions which write fs/gs base are not just
      writing the base, they are also changing the index. That's wrong and needs
      to be separated because writing the base has not to modify the index.
      
      While the regression is not causing any harm right now because the only
      caller depends on that behaviour, it's a guarantee for subtle breakage down
      the road.
      
      Make the index explicitly changed from the caller, instead of including
      the code in the helpers.
      
      Subsequently, the task write helpers do not handle for the current task
      anymore. The range check for a base value is also factored out, to minimize
      code redundancy from the caller.
      
      Fixes: b1378a56 ("x86/fsgsbase/64: Introduce FS/GS base helper functions")
      Suggested-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20181126195524.32179-1-chang.seok.bae@intel.com
      87ab4689
    • Thomas Lendacky's avatar
      x86/speculation: Add support for STIBP always-on preferred mode · 20c3a2c3
      Thomas Lendacky authored
      
      
      Different AMD processors may have different implementations of STIBP.
      When STIBP is conditionally enabled, some implementations would benefit
      from having STIBP always on instead of toggling the STIBP bit through MSR
      writes. This preference is advertised through a CPUID feature bit.
      
      When conditional STIBP support is requested at boot and the CPU advertises
      STIBP always-on mode as preferred, switch to STIBP "on" support. To show
      that this transition has occurred, create a new spectre_v2_user_mitigation
      value and a new spectre_v2_user_strings message. The new mitigation value
      is used in spectre_v2_user_select_mitigation() to print the new mitigation
      message as well as to return a new string from stibp_state().
      
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Link: https://lkml.kernel.org/r/20181213230352.6937.74943.stgit@tlendack-t1.amdoffice.net
      20c3a2c3
Loading