Skip to content
  1. Aug 16, 2019
  2. Aug 05, 2019
    • Paolo Bonzini's avatar
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini authored
      
      
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • Wanpeng Li's avatar
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li authored
      
      
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  3. Jul 30, 2019
  4. Jul 28, 2019
  5. Jul 22, 2019
    • Wanpeng Li's avatar
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li authored
      
      
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • Paolo Bonzini's avatar
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini authored
      
      
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec269475
  6. Jul 18, 2019
    • Josh Poimboeuf's avatar
      x86/kvm: Don't call kvm_spurious_fault() from .fixup · 3901336e
      Josh Poimboeuf authored
      
      
      After making a change to improve objtool's sibling call detection, it
      started showing the following warning:
      
        arch/x86/kvm/vmx/nested.o: warning: objtool: .fixup+0x15: sibling call from callable instruction with modified stack frame
      
      The problem is the ____kvm_handle_fault_on_reboot() macro.  It does a
      fake call by pushing a fake RIP and doing a jump.  That tricks the
      unwinder into printing the function which triggered the exception,
      rather than the .fixup code.
      
      Instead of the hack to make it look like the original function made the
      call, just change the macro so that the original function actually does
      make the call.  This allows removal of the hack, and also makes objtool
      happy.
      
      I triggered a vmx instruction exception and verified that the stack
      trace is still sane:
      
        kernel BUG at arch/x86/kvm/x86.c:358!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 28 PID: 4096 Comm: qemu-kvm Not tainted 5.2.0+ #16
        Hardware name: Lenovo THINKSYSTEM SD530 -[7X2106Z000]-/-[7X2106Z000]-, BIOS -[TEE113Z-1.00]- 07/17/2017
        RIP: 0010:kvm_spurious_fault+0x5/0x10
        Code: 00 00 00 00 00 8b 44 24 10 89 d2 45 89 c9 48 89 44 24 10 8b 44 24 08 48 89 44 24 08 e9 d4 40 22 00 0f 1f 40 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
        RSP: 0018:ffffbf91c683bd00 EFLAGS: 00010246
        RAX: 000061f040000000 RBX: ffff9e159c77bba0 RCX: ffff9e15a5c87000
        RDX: 0000000665c87000 RSI: ffff9e15a5c87000 RDI: ffff9e159c77bba0
        RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9e15a5c87000
        R10: 0000000000000000 R11: fffff8f2d99721c0 R12: ffff9e159c77bba0
        R13: ffffbf91c671d960 R14: ffff9e159c778000 R15: 0000000000000000
        FS:  00007fa341cbe700(0000) GS:ffff9e15b7400000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fdd38356804 CR3: 00000006759de003 CR4: 00000000007606e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         loaded_vmcs_init+0x4f/0xe0
         alloc_loaded_vmcs+0x38/0xd0
         vmx_create_vcpu+0xf7/0x600
         kvm_vm_ioctl+0x5e9/0x980
         ? __switch_to_asm+0x40/0x70
         ? __switch_to_asm+0x34/0x70
         ? __switch_to_asm+0x40/0x70
         ? __switch_to_asm+0x34/0x70
         ? free_one_page+0x13f/0x4e0
         do_vfs_ioctl+0xa4/0x630
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x55/0x1c0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fa349b1ee5b
      
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/64a9b64d127e87b6920a97afde8e96ea76f6524e.1563413318.git.jpoimboe@redhat.com
      3901336e
    • Josh Poimboeuf's avatar
      x86/paravirt: Fix callee-saved function ELF sizes · 083db676
      Josh Poimboeuf authored
      
      
      The __raw_callee_save_*() functions have an ELF symbol size of zero,
      which confuses objtool and other tools.
      
      Fixes a bunch of warnings like the following:
      
        arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pte_val() is missing an ELF size annotation
        arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_pgd_val() is missing an ELF size annotation
        arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pte() is missing an ELF size annotation
        arch/x86/xen/mmu_pv.o: warning: objtool: __raw_callee_save_xen_make_pgd() is missing an ELF size annotation
      
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJuergen Gross <jgross@suse.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/afa6d49bb07497ca62e4fc3b27a2d0cece545b4e.1563413318.git.jpoimboe@redhat.com
      083db676
  7. Jul 17, 2019
  8. Jul 16, 2019
    • Qian Cai's avatar
      x86/apic: Silence -Wtype-limits compiler warnings · ec633558
      Qian Cai authored
      
      
      There are many compiler warnings like this,
      
      In file included from ./arch/x86/include/asm/smp.h:13,
                       from ./arch/x86/include/asm/mmzone_64.h:11,
                       from ./arch/x86/include/asm/mmzone.h:5,
                       from ./include/linux/mmzone.h:969,
                       from ./include/linux/gfp.h:6,
                       from ./include/linux/mm.h:10,
                       from arch/x86/kernel/apic/io_apic.c:34:
      arch/x86/kernel/apic/io_apic.c: In function 'check_timer':
      ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned
      expression >= 0 is always true [-Wtype-limits]
         if ((v) <= apic_verbosity) \
                 ^~
      arch/x86/kernel/apic/io_apic.c:2160:2: note: in expansion of macro
      'apic_printk'
        apic_printk(APIC_QUIET, KERN_INFO "..TIMER: vector=0x%02X "
        ^~~~~~~~~~~
      ./arch/x86/include/asm/apic.h:37:11: warning: comparison of unsigned
      expression >= 0 is always true [-Wtype-limits]
         if ((v) <= apic_verbosity) \
                 ^~
      arch/x86/kernel/apic/io_apic.c:2207:4: note: in expansion of macro
      'apic_printk'
          apic_printk(APIC_QUIET, KERN_ERR "..MP-BIOS bug: "
          ^~~~~~~~~~~
      
      APIC_QUIET is 0, so silence them by making apic_verbosity type int.
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1562621805-24789-1-git-send-email-cai@lca.pw
      ec633558
  9. Jul 12, 2019
    • Mike Rapoport's avatar
      asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel] · 5fba4af4
      Mike Rapoport authored
      Most architectures have identical or very similar implementation of
      pte_alloc_one_kernel(), pte_alloc_one(), pte_free_kernel() and
      pte_free().
      
      Add a generic implementation that can be reused across architectures and
      enable its use on x86.
      
      The generic implementation uses
      
      	GFP_KERNEL | __GFP_ZERO
      
      for the kernel page tables and
      
      	GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
      
      for the user page tables.
      
      The "base" functions for PTE allocation, namely __pte_alloc_one_kernel()
      and __pte_alloc_one() are intended for the architectures that require
      additional actions after actual memory allocation or must use non-default
      GFP flags.
      
      x86 is switched to use generic pte_alloc_one_kernel(), pte_free_kernel() and
      pte_free().
      
      x86 still implements pte_alloc_one() to allow run-time control of GFP
      flags required for "userpte" command line option.
      
      Link: http://lkml.kernel.org/r/1557296232-15361-2-git-send-email-rppt@linux.ibm.com
      
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Creasey <sammy@sammy.net>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5fba4af4
    • Christoph Hellwig's avatar
      mm: lift the x86_32 PAE version of gup_get_pte to common code · 39656e83
      Christoph Hellwig authored
      The split low/high access is the only non-READ_ONCE version of gup_get_pte
      that did show up in the various arch implemenations.  Lift it to common
      code and drop the ifdef based arch override.
      
      Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.de
      
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      39656e83
    • Christoph Hellwig's avatar
      mm: simplify gup_fast_permitted · 26f4c328
      Christoph Hellwig authored
      Pass in the already calculated end value instead of recomputing it, and
      leave the end > start check in the callers instead of duplicating them in
      the arch code.
      
      Link: http://lkml.kernel.org/r/20190625143715.1689-3-hch@lst.de
      
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26f4c328
    • Marco Elver's avatar
      asm-generic, x86: add bitops instrumentation for KASAN · 751ad98d
      Marco Elver authored
      This adds a new header to asm-generic to allow optionally instrumenting
      architecture-specific asm implementations of bitops.
      
      This change includes the required change for x86 as reference and
      changes the kernel API doc to point to bitops-instrumented.h instead.
      Rationale: the functions in x86's bitops.h are no longer the kernel API
      functions, but instead the arch_ prefixed functions, which are then
      instrumented via bitops-instrumented.h.
      
      Other architectures can similarly add support for asm implementations of
      bitops.
      
      The documentation text was derived from x86 and existing bitops
      asm-generic versions: 1) references to x86 have been removed; 2) as a
      result, some of the text had to be reworded for clarity and consistency.
      
      Tested using lib/test_kasan with bitops tests (pre-requisite patch).
      Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=198439
      
      Link: http://lkml.kernel.org/r/20190613125950.197667-4-elver@google.com
      
      
      Signed-off-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      751ad98d
  10. Jul 11, 2019
    • Eric Hankland's avatar
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland authored
      
      
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      
      Signed-off-by: default avatarEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  11. Jul 10, 2019
  12. Jul 09, 2019
    • Josh Poimboeuf's avatar
      x86/speculation: Prepare entry code for Spectre v1 swapgs mitigations · 18ec54fd
      Josh Poimboeuf authored
      
      
      
      Spectre v1 isn't only about array bounds checks.  It can affect any
      conditional checks.  The kernel entry code interrupt, exception, and NMI
      handlers all have conditional swapgs checks.  Those may be problematic in
      the context of Spectre v1, as kernel code can speculatively run with a user
      GS.
      
      For example:
      
      	if (coming from user space)
      		swapgs
      	mov %gs:<percpu_offset>, %reg
      	mov (%reg), %reg1
      
      When coming from user space, the CPU can speculatively skip the swapgs, and
      then do a speculative percpu load using the user GS value.  So the user can
      speculatively force a read of any kernel value.  If a gadget exists which
      uses the percpu value as an address in another load/store, then the
      contents of the kernel value may become visible via an L1 side channel
      attack.
      
      A similar attack exists when coming from kernel space.  The CPU can
      speculatively do the swapgs, causing the user GS to get used for the rest
      of the speculative window.
      
      The mitigation is similar to a traditional Spectre v1 mitigation, except:
      
        a) index masking isn't possible; because the index (percpu offset)
           isn't user-controlled; and
      
        b) an lfence is needed in both the "from user" swapgs path and the
           "from kernel" non-swapgs path (because of the two attacks described
           above).
      
      The user entry swapgs paths already have SWITCH_TO_KERNEL_CR3, which has a
      CR3 write when PTI is enabled.  Since CR3 writes are serializing, the
      lfences can be skipped in those cases.
      
      On the other hand, the kernel entry swapgs paths don't depend on PTI.
      
      To avoid unnecessary lfences for the user entry case, create two separate
      features for alternative patching:
      
        X86_FEATURE_FENCE_SWAPGS_USER
        X86_FEATURE_FENCE_SWAPGS_KERNEL
      
      Use these features in entry code to patch in lfences where needed.
      
      The features aren't enabled yet, so there's no functional change.
      
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDave Hansen <dave.hansen@intel.com>
      18ec54fd
    • Sebastian Andrzej Siewior's avatar
      x86/ldt: Initialize the context lock for init_mm · 39ca5fb4
      Sebastian Andrzej Siewior authored
      
      
      The mutex mm->context->lock for init_mm is not initialized for init_mm.
      This wasn't a problem because it remained unused. This changed however
      since commit
      	4fc19708 ("x86/alternatives: Initialize temporary mm for patching")
      
      Initialize the mutex for init_mm.
      
      Fixes: 4fc19708 ("x86/alternatives: Initialize temporary mm for patching")
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarNadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Link: https://lkml.kernel.org/r/20190701173354.2pe62hhliok2afea@linutronix.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      39ca5fb4
  13. Jul 08, 2019
  14. Jul 07, 2019
  15. Jul 03, 2019
    • Thomas Gleixner's avatar
      x86/fsgsbase: Revert FSGSBASE support · 049331f2
      Thomas Gleixner authored
      
      
      The FSGSBASE series turned out to have serious bugs and there is still an
      open issue which is not fully understood yet.
      
      The confidence in those changes has become close to zero especially as the
      test cases which have been shipped with that series were obviously never
      run before sending the final series out to LKML.
      
        ./fsgsbase_64 >/dev/null
        Segmentation fault
      
      As the merge window is close, the only sane decision is to revert FSGSBASE
      support. The revert is necessary as this branch has been merged into
      perf/core already and rebasing all of that a few days before the merge
      window is not the most brilliant idea.
      
      I could definitely slap myself for not noticing the test case fail when
      merging that series, but TBH my expectations weren't that low back
      then. Won't happen again.
      
      Revert the following commits:
      539bca53 ("x86/entry/64: Fix and clean up paranoid_exit")
      2c7b5ac5 ("Documentation/x86/64: Add documentation for GS/FS addressing mode")
      f987c955 ("x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2")
      2032f1f9 ("x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit")
      5bf0cab6 ("x86/entry/64: Document GSBASE handling in the paranoid path")
      708078f6 ("x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit")
      79e1932f ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
      1d07316b ("x86/entry/64: Switch CR3 before SWAPGS in paranoid entry")
      f60a83df ("x86/process/64: Use FSGSBASE instructions on thread copy and ptrace")
      1ab5f3f7 ("x86/process/64: Use FSBSBASE in switch_to() if available")
      a86b4625 ("x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions")
      8b71340d ("x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions")
      b64ed19b ("x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE")
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Chang S. Bae <chang.seok.bae@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Ravi Shankar <ravi.v.shankar@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      049331f2
    • Michael Kelley's avatar
      clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic · dd2cb348
      Michael Kelley authored
      
      
      Continue consolidating Hyper-V clock and timer code into an ISA
      independent Hyper-V clocksource driver.
      
      Move the existing clocksource code under drivers/hv and arch/x86 to the new
      clocksource driver while separating out the ISA dependencies. Update
      Hyper-V initialization to call initialization and cleanup routines since
      the Hyper-V synthetic clock is not independently enumerated in ACPI.
      
      Update Hyper-V clocksource users in KVM and VDSO to get definitions from
      the new include file.
      
      No behavior is changed and no new functionality is added.
      
      Suggested-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
      dd2cb348
    • Michael Kelley's avatar
      clocksource/drivers: Make Hyper-V clocksource ISA agnostic · fd1fea68
      Michael Kelley authored
      
      
      Hyper-V clock/timer code and data structures are currently mixed
      in with other code in the ISA independent drivers/hv directory as
      well as the ISA dependent Hyper-V code under arch/x86.
      
      Consolidate this code and data structures into a Hyper-V clocksource driver
      to better follow the Linux model. In doing so, separate out the ISA
      dependent portions so the new clocksource driver works for x86 and for the
      in-process Hyper-V on ARM64 code.
      
      To start, move the existing clockevents code to create the new clocksource
      driver. Update the VMbus driver to call initialization and cleanup routines
      since the Hyper-V synthetic timers are not independently enumerated in
      ACPI.
      
      No behavior is changed and no new functionality is added.
      
      Suggested-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-2-git-send-email-mikelley@microsoft.com
      fd1fea68
    • Thomas Gleixner's avatar
      x86/irq: Seperate unused system vectors from spurious entry again · f8a8fe61
      Thomas Gleixner authored
      
      
      Quite some time ago the interrupt entry stubs for unused vectors in the
      system vector range got removed and directly mapped to the spurious
      interrupt vector entry point.
      
      Sounds reasonable, but it's subtly broken. The spurious interrupt vector
      entry point pushes vector number 0xFF on the stack which makes the whole
      logic in __smp_spurious_interrupt() pointless.
      
      As a consequence any spurious interrupt which comes from a vector != 0xFF
      is treated as a real spurious interrupt (vector 0xFF) and not
      acknowledged. That subsequently stalls all interrupt vectors of equal and
      lower priority, which brings the system to a grinding halt.
      
      This can happen because even on 64-bit the system vector space is not
      guaranteed to be fully populated. A full compile time handling of the
      unused vectors is not possible because quite some of them are conditonally
      populated at runtime.
      
      Bring the entry stubs back, which wastes 160 bytes if all stubs are unused,
      but gains the proper handling back. There is no point to selectively spare
      some of the stubs which are known at compile time as the required code in
      the IDT management would be way larger and convoluted.
      
      Do not route the spurious entries through common_interrupt and do_IRQ() as
      the original code did. Route it to smp_spurious_interrupt() which evaluates
      the vector number and acts accordingly now that the real vector numbers are
      handed in.
      
      Fixup the pr_warn so the actual spurious vector (0xff) is clearly
      distiguished from the other vectors and also note for the vectored case
      whether it was pending in the ISR or not.
      
       "Spurious APIC interrupt (vector 0xFF) on CPU#0, should never happen."
       "Spurious interrupt vector 0xed on CPU#1. Acked."
       "Spurious interrupt vector 0xee on CPU#1. Not pending!."
      
      Fixes: 2414e021 ("x86: Avoid building unused IRQ entry stubs")
      Reported-by: default avatarJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Link: https://lkml.kernel.org/r/20190628111440.550568228@linutronix.de
      f8a8fe61
    • Thomas Gleixner's avatar
      x86/irq: Handle spurious interrupt after shutdown gracefully · b7107a67
      Thomas Gleixner authored
      
      
      Since the rework of the vector management, warnings about spurious
      interrupts have been reported. Robert provided some more information and
      did an initial analysis. The following situation leads to these warnings:
      
         CPU 0                  CPU 1               IO_APIC
      
                                                    interrupt is raised
                                                    sent to CPU1
      			  Unable to handle
      			  immediately
      			  (interrupts off,
      			   deep idle delay)
         mask()
         ...
         free()
           shutdown()
           synchronize_irq()
           clear_vector()
                                do_IRQ()
                                  -> vector is clear
      
      Before the rework the vector entries of legacy interrupts were statically
      assigned and occupied precious vector space while most of them were
      unused. Due to that the above situation was handled silently because the
      vector was handled and the core handler of the assigned interrupt
      descriptor noticed that it is shut down and returned.
      
      While this has been usually observed with legacy interrupts, this situation
      is not limited to them. Any other interrupt source, e.g. MSI, can cause the
      same issue.
      
      After adding proper synchronization for level triggered interrupts, this
      can only happen for edge triggered interrupts where the IO-APIC obviously
      cannot provide information about interrupts in flight.
      
      While the spurious warning is actually harmless in this case it worries
      users and driver developers.
      
      Handle it gracefully by marking the vector entry as VECTOR_SHUTDOWN instead
      of VECTOR_UNUSED when the vector is freed up.
      
      If that above late handling happens the spurious detector will not complain
      and switch the entry to VECTOR_UNUSED. Any subsequent spurious interrupt on
      that line will trigger the spurious warning as before.
      
      Fixes: 464d1230 ("x86/vector: Switch IOAPIC to global reservation mode")
      Reported-by: default avatarRobert Hodaszi <Robert.Hodaszi@digi.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de&gt;->
      Tested-by: default avatarRobert Hodaszi <Robert.Hodaszi@digi.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Link: https://lkml.kernel.org/r/20190628111440.459647741@linutronix.de
      b7107a67
  16. Jul 01, 2019
  17. Jun 29, 2019
    • Thomas Gleixner's avatar
      x86/timer: Skip PIT initialization on modern chipsets · c8c40767
      Thomas Gleixner authored
      
      
      Recent Intel chipsets including Skylake and ApolloLake have a special
      ITSSPRC register which allows the 8254 PIT to be gated.  When gated, the
      8254 registers can still be programmed as normal, but there are no IRQ0
      timer interrupts.
      
      Some products such as the Connex L1430 and exone go Rugged E11 use this
      register to ship with the PIT gated by default. This causes Linux to fail
      to boot:
      
        Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with
        apic=debug and send a report.
      
      The panic happens before the framebuffer is initialized, so to the user, it
      appears as an early boot hang on a black screen.
      
      Affected products typically have a BIOS option that can be used to enable
      the 8254 and make Linux work (Chipset -> South Cluster Configuration ->
      Miscellaneous Configuration -> 8254 Clock Gating), however it would be best
      to make Linux support the no-8254 case.
      
      Modern sytems allow to discover the TSC and local APIC timer frequencies,
      so the calibration against the PIT is not required. These systems have
      always running timers and the local APIC timer works also in deep power
      states.
      
      So the setup of the PIT including the IO-APIC timer interrupt delivery
      checks are a pointless exercise.
      
      Skip the PIT setup and the IO-APIC timer interrupt checks on these systems,
      which avoids the panic caused by non ticking PITs and also speeds up the
      boot process.
      
      Thanks to Daniel for providing the changelog, initial analysis of the
      problem and testing against a variety of machines.
      
      Reported-by: default avatarDaniel Drake <drake@endlessm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Cc: bp@alien8.de
      Cc: hpa@zytor.com
      Cc: linux@endlessm.com
      Cc: rafael.j.wysocki@intel.com
      Cc: hdegoede@redhat.com
      Link: https://lkml.kernel.org/r/20190628072307.24678-1-drake@endlessm.com
      c8c40767
  18. Jun 27, 2019
Loading