Skip to content
  1. Mar 29, 2019
  2. Mar 15, 2019
    • Sean Christopherson's avatar
      KVM: doc: Document the life cycle of a VM and its resources · eca6be56
      Sean Christopherson authored
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/
      
      
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eca6be56
  3. Dec 14, 2018
    • Vitaly Kuznetsov's avatar
      x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID · 2bc39970
      Vitaly Kuznetsov authored
      
      
      With every new Hyper-V Enlightenment we implement we're forced to add a
      KVM_CAP_HYPERV_* capability. While this approach works it is fairly
      inconvenient: the majority of the enlightenments we do have corresponding
      CPUID feature bit(s) and userspace has to know this anyways to be able to
      expose the feature to the guest.
      
      Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
      cap to rule them all!") returning all Hyper-V CPUID feature leaves.
      
      Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
      Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x40000000,
      0x40000001) and we would probably confuse userspace in case we decide to
      return these twice.
      
      KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
      KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2bc39970
    • Paolo Bonzini's avatar
      kvm: introduce manual dirty log reprotect · 2a31b9db
      Paolo Bonzini authored
      
      
      There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
      it can take kvm->mmu_lock for an extended period of time.  Second, its user
      can actually see many false positives in some cases.  The latter is due
      to a benign race like this:
      
        1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
           them.
        2. The guest modifies the pages, causing them to be marked ditry.
        3. Userspace actually copies the pages.
        4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
           they were not written to since (3).
      
      This is especially a problem for large guests, where the time between
      (1) and (3) can be substantial.  This patch introduces a new
      capability which, when enabled, makes KVM_GET_DIRTY_LOG not
      write-protect the pages it returns.  Instead, userspace has to
      explicitly clear the dirty log bits just before using the content
      of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
      64-page granularity rather than requiring to sync a full memslot;
      this way, the mmu_lock is taken for small amounts of time, and
      only a small amount of time will pass between write protection
      of pages and the sending of their content.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2a31b9db
    • Paolo Bonzini's avatar
      kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · e5d83c74
      Paolo Bonzini authored
      
      
      The first such capability to be handled in virt/kvm/ will be manual
      dirty page reprotection.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e5d83c74
  4. Oct 17, 2018
    • Jim Mattson's avatar
      kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD · c4f55198
      Jim Mattson authored
      
      
      This is a per-VM capability which can be enabled by userspace so that
      the faulting linear address will be included with the information
      about a pending #PF in L2, and the "new DR6 bits" will be included
      with the information about a pending #DB in L2. With this capability
      enabled, the L1 hypervisor can now intercept #PF before CR2 is
      modified. Under VMX, the L1 hypervisor can now intercept #DB before
      DR6 and DR7 are modified.
      
      When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should
      generally provide an appropriate payload when injecting a #PF or #DB
      exception via KVM_SET_VCPU_EVENTS. However, to support restoring old
      checkpoints, this payload is not required.
      
      Note that bit 16 of the "new DR6 bits" is set to indicate that a debug
      exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
      region while advanced debugging of RTM transactional regions was
      enabled. This is the reverse of DR6.RTM, which is cleared in this
      scenario.
      
      This capability also enables exception.pending in struct
      kvm_vcpu_events, which allows userspace to distinguish between pending
      and injected exceptions.
      
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c4f55198
    • Jim Mattson's avatar
      kvm: x86: Add exception payload fields to kvm_vcpu_events · 59073aaf
      Jim Mattson authored
      
      
      The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
      later commit) adds the following fields to struct kvm_vcpu_events:
      exception_has_payload, exception_payload, and exception.pending.
      
      With this capability set, all of the details of vcpu->arch.exception,
      including the payload for a pending exception, are reported to
      userspace in response to KVM_GET_VCPU_EVENTS.
      
      With this capability clear, the original ABI is preserved, and the
      exception.injected field is set for either pending or injected
      exceptions.
      
      When userspace calls KVM_SET_VCPU_EVENTS with
      KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
      translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
      establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.
      
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      59073aaf
  5. Oct 16, 2018
    • Jim Mattson's avatar
      KVM: Documentation: Fix omission in struct kvm_vcpu_events · bba9ce58
      Jim Mattson authored
      
      
      The header file indicates that there are 36 reserved bytes at the end
      of this structure. Adjust the documentation to agree with the header
      file.
      
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bba9ce58
    • Peng Hao's avatar
      kvm/x86 : add coalesced pio support · 0804c849
      Peng Hao authored
      
      
      Coalesced pio is based on coalesced mmio and can be used for some port
      like rtc port, pci-host config port and so on.
      
      Specially in case of rtc as coalesced pio, some versions of windows guest
      access rtc frequently because of rtc as system tick. guest access rtc like
      this: write register index to 0x70, then write or read data from 0x71.
      writing 0x70 port is just as index and do nothing else. So we can use
      coalesced pio to handle this scene to reduce VM-EXIT time.
      
      When starting and closing a virtual machine, it will access pci-host config
      port frequently. So setting these port as coalesced pio can reduce startup
      and shutdown time.
      
      without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
      using perf tools: (guest OS : windows 7 64bit)
      IO Port Access  Samples Samples%  Time%  Min Time  Max Time  Avg time
      0x70:POUT        86     30.99%    74.59%   9us      29us    10.75us (+- 3.41%)
      0xcf8:POUT     1119     2.60%     2.12%   2.79us    56.83us 3.41us (+- 2.23%)
      
      with my patch
      IO Port Access  Samples Samples%  Time%   Min Time  Max Time   Avg time
      0x70:POUT       106    32.02%    29.47%    0us      10us     1.57us (+- 7.38%)
      0xcf8:POUT      1065    1.67%     0.28%   0.41us    65.44us   0.66us (+- 10.55%)
      
      Signed-off-by: default avatarPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0804c849
    • Peng Hao's avatar
      kvm/x86 : add document for coalesced mmio · 9943450b
      Peng Hao authored
      
      
      Signed-off-by: default avatarPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9943450b
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyperv: implement PV IPI send hypercalls · 214ff83d
      Vitaly Kuznetsov authored
      
      
      Using hypercall for sending IPIs is faster because this allows to specify
      any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
      will take only one VMEXIT.
      
      Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
      hypercall can't be 'fast' (passing parameters through registers) but
      apparently this is not true, Windows always uses it as 'fast' so we need
      to support that.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      214ff83d
  6. Oct 09, 2018
  7. Oct 03, 2018
    • Suzuki K Poulose's avatar
      kvm: arm64: Allow tuning the physical address size for VM · 233a7cb2
      Suzuki K Poulose authored
      
      
      Allow specifying the physical address size limit for a new
      VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This
      allows us to finalise the stage2 page table as early as possible
      and hence perform the right checks on the memory slots
      without complication. The size is encoded as Log2(PA_Size) in
      bits[7:0] of the type field. For backward compatibility the
      value 0 is reserved and implies 40bits. Also, lift the limit
      of the IPA to host limit and allow lower IPA sizes (e.g, 32).
      
      The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE
      for the availability of this feature. The cap check returns the
      maximum limit for the physical address shift supported by the host.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Christoffer Dall <cdall@kernel.org>
      Cc: Peter Maydell <peter.maydell@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      233a7cb2
  8. Sep 19, 2018
    • Drew Schmitt's avatar
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt authored
      
      
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      
      Signed-off-by: default avatarDrew Schmitt <dasch@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
  9. Sep 12, 2018
  10. Aug 22, 2018
  11. Aug 06, 2018
    • Jim Mattson's avatar
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson authored
      
      
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
  12. Jul 30, 2018
    • Janosch Frank's avatar
      KVM: s390: Add huge page enablement control · a4499382
      Janosch Frank authored
      
      
      General KVM huge page support on s390 has to be enabled via the
      kvm.hpage module parameter. Either nested or hpage can be enabled, as
      we currently do not support vSIE for huge backed guests. Once the vSIE
      support is added we will either drop the parameter or enable it as
      default.
      
      For a guest the feature has to be enabled through the new
      KVM_CAP_S390_HPAGE_1M capability and the hpage module
      parameter. Enabling it means that cmm can't be enabled for the vm and
      disables pfmf and storage key interpretation.
      
      This is due to the fact that in some cases, in upcoming patches, we
      have to split huge pages in the guest mapping to be able to set more
      granular memory protection on 4k pages. These split pages have fake
      page tables that are not visible to the Linux memory management which
      subsequently will not manage its PGSTEs, while the SIE will. Disabling
      these features lets us manage PGSTE data in a consistent matter and
      solve that problem.
      
      Signed-off-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      a4499382
  13. Jul 21, 2018
  14. Jun 22, 2018
  15. May 26, 2018
  16. Apr 20, 2018
    • Marc Zyngier's avatar
      arm/arm64: KVM: Add PSCI version selection API · 85bd0ba1
      Marc Zyngier authored
      
      
      Although we've implemented PSCI 0.1, 0.2 and 1.0, we expose either 0.1
      or 1.0 to a guest, defaulting to the latest version of the PSCI
      implementation that is compatible with the requested version. This is
      no different from doing a firmware upgrade on KVM.
      
      But in order to give a chance to hypothetical badly implemented guests
      that would have a fit by discovering something other than PSCI 0.2,
      let's provide a new API that allows userspace to pick one particular
      version of the API.
      
      This is implemented as a new class of "firmware" registers, where
      we expose the PSCI version. This allows the PSCI version to be
      save/restored as part of a guest migration, and also set to
      any supported version if the guest requires it.
      
      Cc: stable@vger.kernel.org #4.16
      Reviewed-by: default avatarChristoffer Dall <cdall@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      85bd0ba1
  17. Mar 28, 2018
  18. Mar 16, 2018
  19. Mar 14, 2018
  20. Mar 06, 2018
  21. Mar 01, 2018
  22. Jan 19, 2018
    • Paul Mackerras's avatar
      KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds · 3214d01f
      Paul Mackerras authored
      
      
      This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
      information about the underlying machine's level of vulnerability
      to the recently announced vulnerabilities CVE-2017-5715,
      CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
      instructions to assist software to work around the vulnerabilities.
      
      The ioctl returns two u64 words describing characteristics of the
      CPU and required software behaviour respectively, plus two mask
      words which indicate which bits have been filled in by the kernel,
      for extensibility.  The bit definitions are the same as for the
      new H_GET_CPU_CHARACTERISTICS hypercall.
      
      There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
      indicates whether the new ioctl is available.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      3214d01f
  23. Jan 16, 2018
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Enable migration of decrementer register · 5855564c
      Paul Mackerras authored
      
      
      This adds a register identifier for use with the one_reg interface
      to allow the decrementer expiry time to be read and written by
      userspace.  The decrementer expiry time is in guest timebase units
      and is equal to the sum of the decrementer and the guest timebase.
      (The expiry time is used rather than the decrementer value itself
      because the expiry time is not constantly changing, though the
      decrementer value is, while the guest vcpu is not running.)
      
      Without this, a guest vcpu migrated to a new host will see its
      decrementer set to some random value.  On POWER8 and earlier, the
      decrementer is 32 bits wide and counts down at 512MHz, so the
      guest vcpu will potentially see no decrementer interrupts for up
      to about 4 seconds, which will lead to a stall.  With POWER9, the
      decrementer is now 56 bits side, so the stall can be much longer
      (up to 2.23 years) and more noticeable.
      
      To help work around the problem in cases where userspace has not been
      updated to migrate the decrementer expiry time, we now set the
      default decrementer expiry at vcpu creation time to the current time
      rather than the maximum possible value.  This should mean an
      immediate decrementer interrupt when a migrated vcpu starts
      running.  In cases where the decrementer is 32 bits wide and more
      than 4 seconds elapse between the creation of the vcpu and when it
      first runs, the decrementer would have wrapped around to positive
      values and there may still be a stall - but this is no worse than
      the current situation.  In the large-decrementer case, we are sure
      to get an immediate decrementer interrupt (assuming the time from
      vcpu creation to first run is less than 2.23 years) and we thus
      avoid a very long stall.
      
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      5855564c
  24. Dec 06, 2017
  25. Dec 04, 2017
    • Brijesh Singh's avatar
      KVM: Introduce KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctl · 69eaedee
      Brijesh Singh authored
      
      
      If hardware supports memory encryption then KVM_MEMORY_ENCRYPT_REG_REGION
      and KVM_MEMORY_ENCRYPT_UNREG_REGION ioctl's can be used by userspace to
      register/unregister the guest memory regions which may contain the encrypted
      data (e.g guest RAM, PCI BAR, SMRAM etc).
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Improvements-by: default avatarBorislav Petkov <bp@suse.de>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      69eaedee
    • Brijesh Singh's avatar
      KVM: Introduce KVM_MEMORY_ENCRYPT_OP ioctl · 5acc5c06
      Brijesh Singh authored
      
      
      If the hardware supports memory encryption then the
      KVM_MEMORY_ENCRYPT_OP ioctl can be used by qemu to issue a platform
      specific memory encryption commands.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      5acc5c06
Loading