Skip to content
  1. Mar 28, 2019
    • Paolo Bonzini's avatar
      Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION · e2788c4a
      Paolo Bonzini authored
      
      
      The documentation does not mention how to delete a slot, add the
      information.
      
      Reported-by: default avatarNathaniel McCallum <npmccallum@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e2788c4a
    • Sean Christopherson's avatar
      KVM: doc: Document the life cycle of a VM and its resources · 919f6cd8
      Sean Christopherson authored
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/
      
      
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      919f6cd8
    • Sean Christopherson's avatar
      KVM: Reject device ioctls from processes other than the VM's creator · ddba9180
      Sean Christopherson authored
      
      
      KVM's API requires thats ioctls must be issued from the same process
      that created the VM.  In other words, userspace can play games with a
      VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
      creator can do anything useful.  Explicitly reject device ioctls that
      are issued by a process other than the VM's creator, and update KVM's
      API documentation to extend its requirements to device ioctls.
      
      Fixes: 852b6d57 ("kvm: add device control API")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ddba9180
    • Sean Christopherson's avatar
      KVM: doc: Fix incorrect word ordering regarding supported use of APIs · 5e124900
      Sean Christopherson authored
      Per Paolo[1], instantiating multiple VMs in a single process is legal;
      but this conflicts with KVM's API documentation, which states:
      
        The only supported use is one virtual machine per process, and one
        vcpu per thread.
      
      However, an earlier section in the documentation states:
      
         Only run VM ioctls from the same process (address space) that was used
         to create the VM.
      
      and:
      
         Only run vcpu ioctls from the same thread that was used to create the
         vcpu.
      
      This suggests that the conflicting documentation is simply an incorrect
      ordering of of words, i.e. what's really meant is that a virtual machine
      can't be shared across multiple processes and a vCPU can't be shared
      across multiple threads.
      
      Tweak the blurb on issuing ioctls to use a more assertive tone, and
      rewrite the "supported use" sentence to reference said blurb instead of
      poorly restating it in different terms.
      
      Opportunistically add missing punctuation.
      
      [1] https://lkml.kernel.org/r/f23265d4-528e-3bd4-011f-4d7b8f3281db@redhat.com
      
      
      
      Fixes: 9c1b96e3 ("KVM: Document basic API")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Improve notes on asynchronous ioctl]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5e124900
    • Sean Christopherson's avatar
      KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' · 47c42e6b
      Sean Christopherson authored
      
      
      The cr4_pae flag is a bit of a misnomer, its purpose is really to track
      whether the guest PTE that is being shadowed is a 4-byte entry or an
      8-byte entry.  Prior to supporting nested EPT, the size of the gpte was
      reflected purely by CR4.PAE.  KVM fudged things a bit for direct sptes,
      but it was mostly harmless since the size of the gpte never mattered.
      Now that a spte may be tracking an indirect EPT entry, relying on
      CR4.PAE is wrong and ill-named.
      
      For direct shadow pages, force the gpte_size to '1' as they are always
      8-byte entries; EPT entries can only be 8-bytes and KVM always uses
      8-byte entries for NPT and its identity map (when running with EPT but
      not unrestricted guest).
      
      Likewise, nested EPT entries are always 8-bytes.  Nested EPT presents a
      unique scenario as the size of the entries are not dictated by CR4.PAE,
      but neither is the shadow page a direct map.  To handle this scenario,
      set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination,
      to denote a nested EPT shadow page.  Use the information to avoid
      incorrectly zapping an unsync'd indirect page in __kvm_sync_page().
      
      Providing a consistent and accurate gpte_size fixes a bug reported by
      Vitaly where fast_cr3_switch() always fails when switching from L2 to
      L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages,
      whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Reported-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47c42e6b
  2. Mar 15, 2019
    • Sean Christopherson's avatar
      KVM: doc: Document the life cycle of a VM and its resources · eca6be56
      Sean Christopherson authored
      The series to add memcg accounting to KVM allocations[1] states:
      
        There are many KVM kernel memory allocations which are tied to the
        life of the VM process and should be charged to the VM process's
        cgroup.
      
      While it is correct to account KVM kernel allocations to the cgroup of
      the process that created the VM, it's technically incorrect to state
      that the KVM kernel memory allocations are tied to the life of the VM
      process.  This is because the VM itself, i.e. struct kvm, is not tied to
      the life of the process which created it, rather it is tied to the life
      of its associated file descriptor.  In other words, kvm_destroy_vm() is
      not invoked until fput() decrements its associated file's refcount to
      zero.  A simple example is to fork() in Qemu and have the child sleep
      indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
      descriptor *and* the rogue child is killed.
      
      The allocations are guaranteed to be *accounted* to the process which
      created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
      ioctl() with -EIO if kvm->mm != current->mm.  I.e. the child can keep
      the VM "alive" but can't do anything useful with its reference.
      
      Note that because 'struct kvm' also holds a reference to the mm_struct
      of its owner, the above behavior also applies to userspace allocations.
      
      Given that mucking with a VM's file descriptor can lead to subtle and
      undesirable behavior, e.g. memcg charges persisting after a VM is shut
      down, explicitly document a VM's lifecycle and its impact on the VM's
      resources.
      
      Alternatively, KVM could aggressively free resources when the creating
      process exits, e.g. via mmu_notifier->release().  However, mmu_notifier
      isn't guaranteed to be available, and freeing resources when the creator
      exits is likely to be error prone and fragile as KVM would need to
      ensure that it only freed resources that are truly out of reach. In
      practice, the existing behavior shouldn't be problematic as a properly
      configured system will prevent a child process from being moved out of
      the appropriate cgroup hierarchy, i.e. prevent hiding the process from
      the OOM killer, and will prevent an unprivileged user from being able to
      to hold a reference to struct kvm via another method, e.g. debugfs.
      
      [1]https://patchwork.kernel.org/patch/10806707/
      
      
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      eca6be56
  3. Mar 06, 2019
  4. Feb 20, 2019
    • Nir Weiner's avatar
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner authored
      
      
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • Sean Christopherson's avatar
      Revert "KVM: MMU: document fast invalidate all pages" · a592a3b8
      Sean Christopherson authored
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1].
      
      Though not explicitly stated, for all intents and purposes the fast
      invalidate mechanism was added to speed up the scenario where removing
      a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
      flush all shadow entries[1].  Now that the memslot case flushes only
      shadow entries belonging to the memslot, i.e. doesn't use the fast
      invalidate mechanism, the only remaining usage of the mechanism are
      when the VM is being destroyed and when the MMIO generation rolls
      over.
      
      When a VM is being destroyed, either there are no active vcpus, i.e.
      there's no lock contention, or the VM has ungracefully terminated, in
      which case we want to reclaim its pages as quickly as possible, i.e.
      not release the MMU lock if there are still CPUs executing in the VM.
      
      The MMIO generation scenario is almost literally a one-in-a-million
      occurrence, i.e. is not a performance sensitive scenario.
      
      Given that lock-breaking is not desirable (VM teardown) or irrelevant
      (MMIO generation overflow), remove the fast invalidate mechanism to
      simplify the code (a small amount) and to discourage future code from
      zapping all pages as using such a big hammer should be a last restort.
      
      This reverts commit f6f8adee.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a592a3b8
    • Sean Christopherson's avatar
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson authored
      
      
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
  5. Jan 11, 2019
  6. Dec 14, 2018
    • Vitaly Kuznetsov's avatar
      x86/kvm/hyper-v: Introduce KVM_GET_SUPPORTED_HV_CPUID · 2bc39970
      Vitaly Kuznetsov authored
      
      
      With every new Hyper-V Enlightenment we implement we're forced to add a
      KVM_CAP_HYPERV_* capability. While this approach works it is fairly
      inconvenient: the majority of the enlightenments we do have corresponding
      CPUID feature bit(s) and userspace has to know this anyways to be able to
      expose the feature to the guest.
      
      Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
      cap to rule them all!") returning all Hyper-V CPUID feature leaves.
      
      Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
      Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x40000000,
      0x40000001) and we would probably confuse userspace in case we decide to
      return these twice.
      
      KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
      KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2bc39970
    • Paolo Bonzini's avatar
      kvm: introduce manual dirty log reprotect · 2a31b9db
      Paolo Bonzini authored
      
      
      There are two problems with KVM_GET_DIRTY_LOG.  First, and less important,
      it can take kvm->mmu_lock for an extended period of time.  Second, its user
      can actually see many false positives in some cases.  The latter is due
      to a benign race like this:
      
        1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
           them.
        2. The guest modifies the pages, causing them to be marked ditry.
        3. Userspace actually copies the pages.
        4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
           they were not written to since (3).
      
      This is especially a problem for large guests, where the time between
      (1) and (3) can be substantial.  This patch introduces a new
      capability which, when enabled, makes KVM_GET_DIRTY_LOG not
      write-protect the pages it returns.  Instead, userspace has to
      explicitly clear the dirty log bits just before using the content
      of the page.  The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
      64-page granularity rather than requiring to sync a full memslot;
      this way, the mmu_lock is taken for small amounts of time, and
      only a small amount of time will pass between write protection
      of pages and the sending of their content.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2a31b9db
    • Paolo Bonzini's avatar
      kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · e5d83c74
      Paolo Bonzini authored
      
      
      The first such capability to be handled in virt/kvm/ will be manual
      dirty page reprotection.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e5d83c74
  7. Oct 17, 2018
    • Jim Mattson's avatar
      kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD · c4f55198
      Jim Mattson authored
      
      
      This is a per-VM capability which can be enabled by userspace so that
      the faulting linear address will be included with the information
      about a pending #PF in L2, and the "new DR6 bits" will be included
      with the information about a pending #DB in L2. With this capability
      enabled, the L1 hypervisor can now intercept #PF before CR2 is
      modified. Under VMX, the L1 hypervisor can now intercept #DB before
      DR6 and DR7 are modified.
      
      When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should
      generally provide an appropriate payload when injecting a #PF or #DB
      exception via KVM_SET_VCPU_EVENTS. However, to support restoring old
      checkpoints, this payload is not required.
      
      Note that bit 16 of the "new DR6 bits" is set to indicate that a debug
      exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
      region while advanced debugging of RTM transactional regions was
      enabled. This is the reverse of DR6.RTM, which is cleared in this
      scenario.
      
      This capability also enables exception.pending in struct
      kvm_vcpu_events, which allows userspace to distinguish between pending
      and injected exceptions.
      
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c4f55198
    • Jim Mattson's avatar
      kvm: x86: Add exception payload fields to kvm_vcpu_events · 59073aaf
      Jim Mattson authored
      
      
      The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
      later commit) adds the following fields to struct kvm_vcpu_events:
      exception_has_payload, exception_payload, and exception.pending.
      
      With this capability set, all of the details of vcpu->arch.exception,
      including the payload for a pending exception, are reported to
      userspace in response to KVM_GET_VCPU_EVENTS.
      
      With this capability clear, the original ABI is preserved, and the
      exception.injected field is set for either pending or injected
      exceptions.
      
      When userspace calls KVM_SET_VCPU_EVENTS with
      KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
      translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
      establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.
      
      Reported-by: default avatarJim Mattson <jmattson@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      59073aaf
  8. Oct 16, 2018
    • Jim Mattson's avatar
      KVM: Documentation: Fix omission in struct kvm_vcpu_events · bba9ce58
      Jim Mattson authored
      
      
      The header file indicates that there are 36 reserved bytes at the end
      of this structure. Adjust the documentation to agree with the header
      file.
      
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bba9ce58
    • Peng Hao's avatar
      kvm/x86 : add coalesced pio support · 0804c849
      Peng Hao authored
      
      
      Coalesced pio is based on coalesced mmio and can be used for some port
      like rtc port, pci-host config port and so on.
      
      Specially in case of rtc as coalesced pio, some versions of windows guest
      access rtc frequently because of rtc as system tick. guest access rtc like
      this: write register index to 0x70, then write or read data from 0x71.
      writing 0x70 port is just as index and do nothing else. So we can use
      coalesced pio to handle this scene to reduce VM-EXIT time.
      
      When starting and closing a virtual machine, it will access pci-host config
      port frequently. So setting these port as coalesced pio can reduce startup
      and shutdown time.
      
      without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
      using perf tools: (guest OS : windows 7 64bit)
      IO Port Access  Samples Samples%  Time%  Min Time  Max Time  Avg time
      0x70:POUT        86     30.99%    74.59%   9us      29us    10.75us (+- 3.41%)
      0xcf8:POUT     1119     2.60%     2.12%   2.79us    56.83us 3.41us (+- 2.23%)
      
      with my patch
      IO Port Access  Samples Samples%  Time%   Min Time  Max Time   Avg time
      0x70:POUT       106    32.02%    29.47%    0us      10us     1.57us (+- 7.38%)
      0xcf8:POUT      1065    1.67%     0.28%   0.41us    65.44us   0.66us (+- 10.55%)
      
      Signed-off-by: default avatarPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0804c849
    • Peng Hao's avatar
      kvm/x86 : add document for coalesced mmio · 9943450b
      Peng Hao authored
      
      
      Signed-off-by: default avatarPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9943450b
    • Vitaly Kuznetsov's avatar
      KVM: x86: hyperv: implement PV IPI send hypercalls · 214ff83d
      Vitaly Kuznetsov authored
      
      
      Using hypercall for sending IPIs is faster because this allows to specify
      any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
      will take only one VMEXIT.
      
      Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
      hypercall can't be 'fast' (passing parameters through registers) but
      apparently this is not true, Windows always uses it as 'fast' so we need
      to support that.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      214ff83d
  9. Oct 09, 2018
  10. Oct 03, 2018
    • Suzuki K Poulose's avatar
      kvm: arm64: Allow tuning the physical address size for VM · 233a7cb2
      Suzuki K Poulose authored
      
      
      Allow specifying the physical address size limit for a new
      VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This
      allows us to finalise the stage2 page table as early as possible
      and hence perform the right checks on the memory slots
      without complication. The size is encoded as Log2(PA_Size) in
      bits[7:0] of the type field. For backward compatibility the
      value 0 is reserved and implies 40bits. Also, lift the limit
      of the IPA to host limit and allow lower IPA sizes (e.g, 32).
      
      The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE
      for the availability of this feature. The cap check returns the
      maximum limit for the physical address shift supported by the host.
      
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Christoffer Dall <cdall@kernel.org>
      Cc: Peter Maydell <peter.maydell@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: default avatarMarc Zyngier <marc.zyngier@arm.com>
      233a7cb2
  11. Sep 19, 2018
    • Drew Schmitt's avatar
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt authored
      
      
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      
      Signed-off-by: default avatarDrew Schmitt <dasch@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
  12. Sep 12, 2018
  13. Sep 09, 2018
    • Henrik Austad's avatar
      Drop all 00-INDEX files from Documentation/ · a7ddcea5
      Henrik Austad authored
      
      
      This is a respin with a wider audience (all that get_maintainer returned)
      and I know this spams a *lot* of people. Not sure what would be the correct
      way, so my apologies for ruining your inbox.
      
      The 00-INDEX files are supposed to give a summary of all files present
      in a directory, but these files are horribly out of date and their
      usefulness is brought into question. Often a simple "ls" would reveal
      the same information as the filenames are generally quite descriptive as
      a short introduction to what the file covers (it should not surprise
      anyone what Documentation/sched/sched-design-CFS.txt covers)
      
      A few years back it was mentioned that these files were no longer really
      needed, and they have since then grown further out of date, so perhaps
      it is time to just throw them out.
      
      A short status yields the following _outdated_ 00-INDEX files, first
      counter is files listed in 00-INDEX but missing in the directory, last
      is files present but not listed in 00-INDEX.
      
      List of outdated 00-INDEX:
      Documentation: (4/10)
      Documentation/sysctl: (0/1)
      Documentation/timers: (1/0)
      Documentation/blockdev: (3/1)
      Documentation/w1/slaves: (0/1)
      Documentation/locking: (0/1)
      Documentation/devicetree: (0/5)
      Documentation/power: (1/1)
      Documentation/powerpc: (0/5)
      Documentation/arm: (1/0)
      Documentation/x86: (0/9)
      Documentation/x86/x86_64: (1/1)
      Documentation/scsi: (4/4)
      Documentation/filesystems: (2/9)
      Documentation/filesystems/nfs: (0/2)
      Documentation/cgroup-v1: (0/2)
      Documentation/kbuild: (0/4)
      Documentation/spi: (1/0)
      Documentation/virtual/kvm: (1/0)
      Documentation/scheduler: (0/2)
      Documentation/fb: (0/1)
      Documentation/block: (0/1)
      Documentation/networking: (6/37)
      Documentation/vm: (1/3)
      
      Then there are 364 subdirectories in Documentation/ with several files that
      are missing 00-INDEX alltogether (and another 120 with a single file and no
      00-INDEX).
      
      I don't really have an opinion to whether or not we /should/ have 00-INDEX,
      but the above 00-INDEX should either be removed or be kept up to date. If
      we should keep the files, I can try to keep them updated, but I rather not
      if we just want to delete them anyway.
      
      As a starting point, remove all index-files and references to 00-INDEX and
      see where the discussion is going.
      
      Signed-off-by: default avatarHenrik Austad <henrik@austad.us>
      Acked-by: default avatar"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Just-do-it-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarMark Brown <broonie@kernel.org>
      Acked-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: [Almost everybody else]
      Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
      a7ddcea5
  14. Aug 22, 2018
  15. Aug 06, 2018
    • Wanpeng Li's avatar
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li authored
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141
      
      ):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
    • Jim Mattson's avatar
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson authored
      
      
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
  16. Jul 30, 2018
    • Janosch Frank's avatar
      KVM: s390: Add huge page enablement control · a4499382
      Janosch Frank authored
      
      
      General KVM huge page support on s390 has to be enabled via the
      kvm.hpage module parameter. Either nested or hpage can be enabled, as
      we currently do not support vSIE for huge backed guests. Once the vSIE
      support is added we will either drop the parameter or enable it as
      default.
      
      For a guest the feature has to be enabled through the new
      KVM_CAP_S390_HPAGE_1M capability and the hpage module
      parameter. Enabling it means that cmm can't be enabled for the vm and
      disables pfmf and storage key interpretation.
      
      This is due to the fact that in some cases, in upcoming patches, we
      have to split huge pages in the guest mapping to be able to set more
      granular memory protection on 4k pages. These split pages have fake
      page tables that are not visible to the Linux memory management which
      subsequently will not manage its PGSTEs, while the SIE will. Disabling
      these features lets us manage PGSTE data in a consistent matter and
      solve that problem.
      
      Signed-off-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      a4499382
  17. Jul 21, 2018
  18. Jun 22, 2018
  19. Jun 01, 2018
  20. May 26, 2018
Loading