Skip to content
  1. Apr 10, 2021
  2. Apr 08, 2021
  3. Apr 02, 2021
    • Peter Zijlstra's avatar
      objtool/x86: Rewrite retpoline thunk calls · 9bc0bb50
      Peter Zijlstra authored
      
      
      When the compiler emits: "CALL __x86_indirect_thunk_\reg" for an
      indirect call, have objtool rewrite it to:
      
      	ALTERNATIVE "call __x86_indirect_thunk_\reg",
      		    "call *%reg", ALT_NOT(X86_FEATURE_RETPOLINE)
      
      Additionally, in order to not emit endless identical
      .altinst_replacement chunks, use a global symbol for them, see
      __x86_indirect_alt_*.
      
      This also avoids objtool from having to do code generation.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarMiroslav Benes <mbenes@suse.cz>
      Link: https://lkml.kernel.org/r/20210326151300.320177914@infradead.org
      9bc0bb50
    • Peter Zijlstra's avatar
      x86/retpoline: Simplify retpolines · 11925185
      Peter Zijlstra authored
      
      
      Due to:
      
        c9c324dc ("objtool: Support stack layout changes in alternatives")
      
      it is now possible to simplify the retpolines.
      
      Currently our retpolines consist of 2 symbols:
      
       - __x86_indirect_thunk_\reg: the compiler target
       - __x86_retpoline_\reg:  the actual retpoline.
      
      Both are consecutive in code and aligned such that for any one register
      they both live in the same cacheline:
      
        0000000000000000 <__x86_indirect_thunk_rax>:
         0:   ff e0                   jmpq   *%rax
         2:   90                      nop
         3:   90                      nop
         4:   90                      nop
      
        0000000000000005 <__x86_retpoline_rax>:
         5:   e8 07 00 00 00          callq  11 <__x86_retpoline_rax+0xc>
         a:   f3 90                   pause
         c:   0f ae e8                lfence
         f:   eb f9                   jmp    a <__x86_retpoline_rax+0x5>
        11:   48 89 04 24             mov    %rax,(%rsp)
        15:   c3                      retq
        16:   66 2e 0f 1f 84 00 00 00 00 00   nopw   %cs:0x0(%rax,%rax,1)
      
      The thunk is an alternative_2, where one option is a JMP to the
      retpoline. This was done so that objtool didn't need to deal with
      alternatives with stack ops. But that problem has been solved, so now
      it is possible to fold the entire retpoline into the alternative to
      simplify and consolidate unused bytes:
      
        0000000000000000 <__x86_indirect_thunk_rax>:
         0:   ff e0                   jmpq   *%rax
         2:   90                      nop
         3:   90                      nop
         4:   90                      nop
         5:   90                      nop
         6:   90                      nop
         7:   90                      nop
         8:   90                      nop
         9:   90                      nop
         a:   90                      nop
         b:   90                      nop
         c:   90                      nop
         d:   90                      nop
         e:   90                      nop
         f:   90                      nop
        10:   90                      nop
        11:   66 66 2e 0f 1f 84 00 00 00 00 00        data16 nopw %cs:0x0(%rax,%rax,1)
        1c:   0f 1f 40 00             nopl   0x0(%rax)
      
      Notice that since the longest alternative sequence is now:
      
         0:   e8 07 00 00 00          callq  c <.altinstr_replacement+0xc>
         5:   f3 90                   pause
         7:   0f ae e8                lfence
         a:   eb f9                   jmp    5 <.altinstr_replacement+0x5>
         c:   48 89 04 24             mov    %rax,(%rsp)
        10:   c3                      retq
      
      17 bytes, we have 15 bytes NOP at the end of our 32 byte slot. (IOW, if
      we can shrink the retpoline by 1 byte we can pack it more densely).
      
       [ bp: Massage commit message. ]
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210326151259.506071949@infradead.org
      11925185
    • Peter Zijlstra's avatar
      x86/alternatives: Optimize optimize_nops() · 23c1ad53
      Peter Zijlstra authored
      
      
      Currently, optimize_nops() scans to see if the alternative starts with
      NOPs. However, the emit pattern is:
      
        141:	\oldinstr
        142:	.skip (len-(142b-141b)), 0x90
      
      That is, when 'oldinstr' is short, the tail is padded with NOPs. This case
      never gets optimized.
      
      Rewrite optimize_nops() to replace any trailing string of NOPs inside
      the alternative to larger NOPs. Also run it irrespective of patching,
      replacing NOPs in both the original and replaced code.
      
      A direct consequence is that 'padlen' becomes superfluous, so remove it.
      
       [ bp:
         - Adjust commit message
         - remove a stale comment about needing to pad
         - add a comment in optimize_nops()
         - exit early if the NOP verif. loop catches a mismatch - function
           should not not add NOPs in that case
         - fix the "optimized NOPs" offsets output ]
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20210326151259.442992235@infradead.org
      23c1ad53
  4. Mar 31, 2021
  5. Mar 25, 2021
  6. Mar 24, 2021
  7. Mar 23, 2021
  8. Mar 20, 2021
  9. Mar 19, 2021
    • Johan Hovold's avatar
      x86/apic/of: Fix CPU devicetree-node lookups · dd926880
      Johan Hovold authored
      
      
      Architectures that describe the CPU topology in devicetree and do not have
      an identity mapping between physical and logical CPU ids must override the
      default implementation of arch_match_cpu_phys_id().
      
      Failing to do so breaks CPU devicetree-node lookups using of_get_cpu_node()
      and of_cpu_device_node_get() which several drivers rely on. It also causes
      the CPU struct devices exported through sysfs to point to the wrong
      devicetree nodes.
      
      On x86, CPUs are described in devicetree using their APIC ids and those
      do not generally coincide with the logical ids, even if CPU0 typically
      uses APIC id 0.
      
      Add the missing implementation of arch_match_cpu_phys_id() so that CPU-node
      lookups work also with SMP.
      
      Apart from fixing the broken sysfs devicetree-node links this likely does
      not affect current users of mainline kernels on x86.
      
      Fixes: 4e07db9c ("x86/devicetree: Use CPU description from Device Tree")
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210312092033.26317-1-johan@kernel.org
      dd926880
    • Thomas Gleixner's avatar
      x86/ioapic: Ignore IRQ2 again · a501b048
      Thomas Gleixner authored
      
      
      Vitaly ran into an issue with hotplugging CPU0 on an Amazon instance where
      the matrix allocator claimed to be out of vectors. He analyzed it down to
      the point that IRQ2, the PIC cascade interrupt, which is supposed to be not
      ever routed to the IO/APIC ended up having an interrupt vector assigned
      which got moved during unplug of CPU0.
      
      The underlying issue is that IRQ2 for various reasons (see commit
      af174783 ("x86: I/O APIC: Never configure IRQ2" for details) is treated
      as a reserved system vector by the vector core code and is not accounted as
      a regular vector. The Amazon BIOS has an routing entry of pin2 to IRQ2
      which causes the IO/APIC setup to claim that interrupt which is granted by
      the vector domain because there is no sanity check. As a consequence the
      allocation counter of CPU0 underflows which causes a subsequent unplug to
      fail with:
      
        [ ... ] CPU 0 has 4294967295 vectors, 589 available. Cannot disable CPU
      
      There is another sanity check missing in the matrix allocator, but the
      underlying root cause is that the IO/APIC code lost the IRQ2 ignore logic
      during the conversion to irqdomains.
      
      For almost 6 years nobody complained about this wreckage, which might
      indicate that this requirement could be lifted, but for any system which
      actually has a PIC IRQ2 is unusable by design so any routing entry has no
      effect and the interrupt cannot be connected to a device anyway.
      
      Due to that and due to history biased paranoia reasons restore the IRQ2
      ignore logic and treat it as non existent despite a routing entry claiming
      otherwise.
      
      Fixes: d32932d0 ("x86/irq: Convert IOAPIC to use hierarchical irqdomain interfaces")
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20210318192819.636943062@linutronix.de
      
      a501b048
  10. Mar 18, 2021
    • Wanpeng Li's avatar
      x86/kvm: Fix broken irq restoration in kvm_wait · f4e61f0c
      Wanpeng Li authored
      
      
      After commit 997acaf6 (lockdep: report broken irq restoration), the guest
      splatting below during boot:
      
       raw_local_irq_restore() called with IRQs enabled
       WARNING: CPU: 1 PID: 169 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x26/0x30
       Modules linked in: hid_generic usbhid hid
       CPU: 1 PID: 169 Comm: systemd-udevd Not tainted 5.11.0+ #25
       RIP: 0010:warn_bogus_irq_restore+0x26/0x30
       Call Trace:
        kvm_wait+0x76/0x90
        __pv_queued_spin_lock_slowpath+0x285/0x2e0
        do_raw_spin_lock+0xc9/0xd0
        _raw_spin_lock+0x59/0x70
        lockref_get_not_dead+0xf/0x50
        __legitimize_path+0x31/0x60
        legitimize_root+0x37/0x50
        try_to_unlazy_next+0x7f/0x1d0
        lookup_fast+0xb0/0x170
        path_openat+0x165/0x9b0
        do_filp_open+0x99/0x110
        do_sys_openat2+0x1f1/0x2e0
        do_sys_open+0x5c/0x80
        __x64_sys_open+0x21/0x30
        do_syscall_64+0x32/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The new consistency checking,  expects local_irq_save() and
      local_irq_restore() to be paired and sanely nested, and therefore expects
      local_irq_restore() to be called with irqs disabled.
      The irqflags handling in kvm_wait() which ends up doing:
      
      	local_irq_save(flags);
      	safe_halt();
      	local_irq_restore(flags);
      
      instead triggers it.  This patch fixes it by using
      local_irq_disable()/enable() directly.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1615791328-2735-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f4e61f0c
    • Wanpeng Li's avatar
      KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs · c2162e13
      Wanpeng Li authored
      
      
      In order to deal with noncoherent DMA, we should execute wbinvd on
      all dirty pCPUs when guest wbinvd exits to maintain data consistency.
      smp_call_function_many() does not execute the provided function on the
      local core, therefore replace it by on_each_cpu_mask().
      
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Cc: Nadav Amit <namit@vmware.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c2162e13
    • Sean Christopherson's avatar
      KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish · b318e8de
      Sean Christopherson authored
      Fix a plethora of issues with MSR filtering by installing the resulting
      filter as an atomic bundle instead of updating the live filter one range
      at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
      the hardware MSR bitmaps won't be updated until the next VM-Enter, but
      the relevant software struct is atomically updated, which is what KVM
      really needs.
      
      Similar to the approach used for modifying memslots, make arch.msr_filter
      a SRCU-protected pointer, do all the work configuring the new filter
      outside of kvm->lock, and then acquire kvm->lock only when the new filter
      has been vetted and created.  That way vCPU readers either see the old
      filter or the new filter in their entirety, not some half-baked state.
      
      Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
      TOCTOU bug, but that's just the tip of the iceberg...
      
        - Nothing is __rcu annotated, making it nigh impossible to audit the
          code for correctness.
        - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
          coding style aside, the lack of a smb_rmb() anywhere casts all code
          into doubt.
        - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
          count before taking the lock.
        - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.
      
      The entire approach of updating the live filter is also flawed.  While
      installing a new filter is inherently racy if vCPUs are running, fixing
      the above issues also makes it trivial to ensure certain behavior is
      deterministic, e.g. KVM can provide deterministic behavior for MSRs with
      identical settings in the old and new filters.  An atomic update of the
      filter also prevents KVM from getting into a half-baked state, e.g. if
      installing a filter fails, the existing approach would leave the filter
      in a half-baked state, having already committed whatever bits of the
      filter were already processed.
      
      [*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com
      
      
      
      Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
      Cc: stable@vger.kernel.org
      Cc: Alexander Graf <graf@amazon.com>
      Reported-by: default avatarYuan Yao <yaoyuan0329os@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210316184436.2544875-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b318e8de
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment · 0469f2f7
      Vitaly Kuznetsov authored
      
      
      When guest opts for re-enlightenment notifications upon migration, it is
      in its right to assume that TSC page values never change (as they're only
      supposed to change upon migration and the host has to keep things as they
      are before it receives confirmation from the guest). This is mostly true
      until the guest is migrated somewhere. KVM userspace (e.g. QEMU) will
      trigger masterclock update by writing to HV_X64_MSR_REFERENCE_TSC, by
      calling KVM_SET_CLOCK,... and as TSC value and kvmclock reading drift
      apart (even slightly), the update causes TSC page values to change.
      
      The issue at hand is that when Hyper-V is migrated, it uses stale (cached)
      TSC page values to compute the difference between its own clocksource
      (provided by KVM) and its guests' TSC pages to program synthetic timers
      and in some cases, when TSC page is updated, this puts all stimer
      expirations in the past. This, in its turn, causes an interrupt storm
      and L2 guests not making much forward progress.
      
      Note, KVM doesn't fully implement re-enlightenment notification. Basically,
      the support for reenlightenment MSRs is just a stub and userspace is only
      expected to expose the feature when TSC scaling on the expected destination
      hosts is available. With TSC scaling, no real re-enlightenment is needed
      as TSC frequency doesn't change. With TSC scaling becoming ubiquitous, it
      likely makes little sense to fully implement re-enlightenment in KVM.
      
      Prevent TSC page from being updated after migration. In case it's not the
      guest who's initiating the change and when TSC page is already enabled,
      just keep it as it is: TSC value is supposed to be preserved across
      migration and TSC frequency can't change with re-enlightenment enabled.
      The guest is doomed anyway if any of this is not true.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210316143736.964151-5-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0469f2f7
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyper-v: Track Hyper-V TSC page status · cc9cfddb
      Vitaly Kuznetsov authored
      
      
      Create an infrastructure for tracking Hyper-V TSC page status, i.e. if it
      was updated from guest/host side or if we've failed to set it up (because
      e.g. guest wrote some garbage to HV_X64_MSR_REFERENCE_TSC) and there's no
      need to retry.
      
      Also, in a hypothetical situation when we are in 'always catchup' mode for
      TSC we can now avoid contending 'hv->hv_lock' on every guest enter by
      setting the state to HV_TSC_PAGE_BROKEN after compute_tsc_page_parameters()
      returns false.
      
      Check for HV_TSC_PAGE_SET state instead of '!hv->tsc_ref.tsc_sequence' in
      get_time_ref_counter() to properly handle the situation when we failed to
      write the updated TSC page values to the guest.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210316143736.964151-4-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cc9cfddb
  11. Mar 17, 2021
  12. Mar 16, 2021
  13. Mar 15, 2021
    • Peter Zijlstra's avatar
      x86: Remove dynamic NOP selection · a89dfde3
      Peter Zijlstra authored
      
      
      This ensures that a NOP is a NOP and not a random other instruction that
      is also a NOP. It allows simplification of dynamic code patching that
      wants to verify existing code before writing new instructions (ftrace,
      jump_label, static_call, etc..).
      
      Differentiating on NOPs is not a feature.
      
      This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS).
      32bit is not a performance target.
      
      Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is
      fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL
      being required for x86_64, all x86_64 CPUs can use NOPL. So stop
      caring about NOPs, simplify things and get on with life.
      
      [ The problem seems to be that some uarchs can only decode NOPL on a
      single front-end port while others have severe decode penalties for
      excessive prefixes. All modern uarchs can handle both, except Atom,
      which has prefix penalties. ]
      
      [ Also, much doubt you can actually measure any of this on normal
      workloads. ]
      
      After this, FEATURE_NOPL is unused except for required-features for
      x86_64. FEATURE_K8 is only used for PTI.
      
       [ bp: Kernel build measurements showed ~0.3s slowdown on Sandybridge
         which is hardly a slowdown. Get rid of X86_FEATURE_K7, while at it. ]
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf
      Acked-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Link: https://lkml.kernel.org/r/20210312115749.065275711@infradead.org
      a89dfde3
Loading