api.rst

utilises it.

If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
available, it means that the VM is using full hardware assisted virtualization
capabilities of the hardware. This is useful to check after creating a VM with
KVM_VM_MIPS_DEFAULT.

The value returned by KVM_CHECK_EXTENSION should be compared against known
values (see below). All other values are reserved. This is to allow for the
possibility of other hardware assisted virtualization implementations which
may be incompatible with the MIPS VZ ASE.

==  ==========================================================================
 0  The trap & emulate implementation is in use to run guest code in user
    mode. Guest virtual memory segments are rearranged to fit the guest in the
    user mode address space.

 1  The MIPS VZ ASE is in use, providing full hardware assisted
    virtualization, including standard guest virtual memory segments.
==  ==========================================================================

8.6 KVM_CAP_MIPS_TE
-------------------

:Architectures: mips

This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that
it is available, means that the trap & emulate implementation is available to
run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware
assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed
to KVM_CREATE_VM to create a VM which utilises it.

If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is
available, it means that the VM is using trap & emulate.

8.7 KVM_CAP_MIPS_64BIT
----------------------

:Architectures: mips

This capability indicates the supported architecture type of the guest, i.e. the
supported register and address width.

The values returned when this capability is checked by KVM_CHECK_EXTENSION on a
kvm VM handle correspond roughly to the CP0_Config.AT register field, and should
be checked specifically against known values (see below). All other values are
reserved.

==  ========================================================================
 0  MIPS32 or microMIPS32.
    Both registers and addresses are 32-bits wide.
    It will only be possible to run 32-bit guest code.

 1  MIPS64 or microMIPS64 with access only to 32-bit compatibility segments.
    Registers are 64-bits wide, but addresses are 32-bits wide.
    64-bit guest code may run but cannot access MIPS64 memory segments.
    It will also be possible to run 32-bit guest code.

 2  MIPS64 or microMIPS64 with access to all address segments.
    Both registers and addresses are 64-bits wide.
    It will be possible to run 64-bit or 32-bit guest code.
==  ========================================================================

8.9 KVM_CAP_ARM_USER_IRQ
------------------------

:Architectures: arm, arm64

This capability, if KVM_CHECK_EXTENSION indicates that it is available, means
that if userspace creates a VM without an in-kernel interrupt controller, it
will be notified of changes to the output level of in-kernel emulated devices,
which can generate virtual interrupts, presented to the VM.
For such VMs, on every return to userspace, the kernel
updates the vcpu's run->s.regs.device_irq_level field to represent the actual
output level of the device.

Whenever kvm detects a change in the device output level, kvm guarantees at
least one return to userspace before running the VM.  This exit could either
be a KVM_EXIT_INTR or any other exit event, like KVM_EXIT_MMIO. This way,
userspace can always sample the device output level and re-compute the state of
the userspace interrupt controller.  Userspace should always check the state
of run->s.regs.device_irq_level on every kvm exit.
The value in run->s.regs.device_irq_level can represent both level and edge
triggered interrupt signals, depending on the device.  Edge triggered interrupt
signals will exit to userspace with the bit in run->s.regs.device_irq_level
set exactly once per edge signal.

The field run->s.regs.device_irq_level is available independent of
run->kvm_valid_regs or run->kvm_dirty_regs bits.

If KVM_CAP_ARM_USER_IRQ is supported, the KVM_CHECK_EXTENSION ioctl returns a
number larger than 0 indicating the version of this capability is implemented
and thereby which bits in run->s.regs.device_irq_level can signal values.

Currently the following bits are defined for the device_irq_level bitmap::

  KVM_CAP_ARM_USER_IRQ >= 1:

    KVM_ARM_DEV_EL1_VTIMER -  EL1 virtual timer
    KVM_ARM_DEV_EL1_PTIMER -  EL1 physical timer
    KVM_ARM_DEV_PMU        -  ARM PMU overflow interrupt signal

Future versions of kvm may implement additional events. These will get
indicated by returning a higher number from KVM_CHECK_EXTENSION and will be
listed above.

8.10 KVM_CAP_PPC_SMT_POSSIBLE
-----------------------------

:Architectures: ppc

Querying this capability returns a bitmap indicating the possible
virtual SMT modes that can be set using KVM_CAP_PPC_SMT.  If bit N
(counting from the right) is set, then a virtual SMT mode of 2^N is
available.

8.11 KVM_CAP_HYPERV_SYNIC2
--------------------------

:Architectures: x86

This capability enables a newer version of Hyper-V Synthetic interrupt
controller (SynIC).  The only difference with KVM_CAP_HYPERV_SYNIC is that KVM
doesn't clear SynIC message and event flags pages when they are enabled by
writing to the respective MSRs.

8.12 KVM_CAP_HYPERV_VP_INDEX
----------------------------

:Architectures: x86

This capability indicates that userspace can load HV_X64_MSR_VP_INDEX msr.  Its
value is used to denote the target vcpu for a SynIC interrupt.  For
compatibilty, KVM initializes this msr to KVM's internal vcpu index.  When this
capability is absent, userspace can still query this msr's value.

8.13 KVM_CAP_S390_AIS_MIGRATION
-------------------------------

:Architectures: s390
:Parameters: none

This capability indicates if the flic device will be able to get/set the
AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
to discover this without having to create a flic device.

8.14 KVM_CAP_S390_PSW
---------------------

:Architectures: s390

This capability indicates that the PSW is exposed via the kvm_run structure.

8.15 KVM_CAP_S390_GMAP
----------------------

:Architectures: s390

This capability indicates that the user space memory used as guest mapping can
be anywhere in the user memory address space, as long as the memory slots are
aligned and sized to a segment (1MB) boundary.

8.16 KVM_CAP_S390_COW
---------------------

:Architectures: s390

This capability indicates that the user space memory used as guest mapping can
use copy-on-write semantics as well as dirty pages tracking via read-only page
tables.

8.17 KVM_CAP_S390_BPB
---------------------

:Architectures: s390

This capability indicates that kvm will implement the interfaces to handle
reset, migration and nested KVM for branch prediction blocking. The stfle
facility 82 should not be provided to the guest without this capability.

8.18 KVM_CAP_HYPERV_TLBFLUSH
----------------------------

:Architectures: x86

This capability indicates that KVM supports paravirtualized Hyper-V TLB Flush
hypercalls:
HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx,
HvFlushVirtualAddressList, HvFlushVirtualAddressListEx.

8.19 KVM_CAP_ARM_INJECT_SERROR_ESR
----------------------------------

:Architectures: arm, arm64

This capability indicates that userspace can specify (via the
KVM_SET_VCPU_EVENTS ioctl) the syndrome value reported to the guest when it
takes a virtual SError interrupt exception.
If KVM advertises this capability, userspace can only specify the ISS field for
the ESR syndrome. Other parts of the ESR, such as the EC are generated by the
CPU when the exception is taken. If this virtual SError is taken to EL1 using
AArch64, this value will be reported in the ISS field of ESR_ELx.

See KVM_CAP_VCPU_EVENTS for more details.

8.20 KVM_CAP_HYPERV_SEND_IPI
----------------------------

:Architectures: x86

This capability indicates that KVM supports paravirtualized Hyper-V IPI send
hypercalls:
HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.

8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
-----------------------------------

:Architectures: x86

This capability indicates that KVM running on top of Hyper-V hypervisor
enables Direct TLB flush for its guests meaning that TLB flush
hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM.
Due to the different ABI for hypercall parameters between Hyper-V and
KVM, enabling this capability effectively disables all hypercall
handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
flush hypercalls by Hyper-V) so userspace should disable KVM identification
in CPUID and only exposes Hyper-V identification. In this case, guest
thinks it's running on Hyper-V and only use Hyper-V hypercalls.

8.22 KVM_CAP_S390_VCPU_RESETS
-----------------------------

:Architectures: s390

This capability indicates that the KVM_S390_NORMAL_RESET and
KVM_S390_CLEAR_RESET ioctls are available.

8.23 KVM_CAP_S390_PROTECTED
---------------------------

:Architectures: s390

This capability indicates that the Ultravisor has been initialized and
KVM can therefore start protected VMs.
This capability governs the KVM_S390_PV_COMMAND ioctl and the
KVM_MP_STATE_LOAD MP_STATE. KVM_SET_MP_STATE can fail for protected
guests when the state change is invalid.

8.24 KVM_CAP_STEAL_TIME
-----------------------

:Architectures: arm64, x86

This capability indicates that KVM supports steal time accounting.
When steal time accounting is supported it may be enabled with
architecture-specific interfaces.  This capability and the architecture-
specific interfaces must be consistent, i.e. if one says the feature
is supported, than the other should as well and vice versa.  For arm64
see Documentation/virt/kvm/devices/vcpu.rst "KVM_ARM_VCPU_PVTIME_CTRL".
For x86 see Documentation/virt/kvm/msr.rst "MSR_KVM_STEAL_TIME".

8.25 KVM_CAP_S390_DIAG318
-------------------------

:Architectures: s390

This capability enables a guest to set information about its control program
(i.e. guest kernel type and version). The information is helpful during
system/firmware service events, providing additional data about the guest
environments running on the machine.

The information is associated with the DIAGNOSE 0x318 instruction, which sets
an 8-byte value consisting of a one-byte Control Program Name Code (CPNC) and
a 7-byte Control Program Version Code (CPVC). The CPNC determines what
environment the control program is running in (e.g. Linux, z/VM...), and the
CPVC is used for information specific to OS (e.g. Linux version, Linux
distribution...)

If this capability is available, then the CPNC and CPVC can be synchronized
between KVM and userspace via the sync regs mechanism (KVM_SYNC_DIAG318).

8.26 KVM_CAP_X86_USER_SPACE_MSR
-------------------------------

:Architectures: x86

This capability indicates that KVM supports deflection of MSR reads and
writes to user space. It can be enabled on a VM level. If enabled, MSR
accesses that would usually trigger a #GP by KVM into the guest will
instead get bounced to user space through the KVM_EXIT_X86_RDMSR and
KVM_EXIT_X86_WRMSR exit notifications.

8.27 KVM_CAP_X86_MSR_FILTER
---------------------------

:Architectures: x86

This capability indicates that KVM supports that accesses to user defined MSRs
may be rejected. With this capability exposed, KVM exports new VM ioctl
KVM_X86_SET_MSR_FILTER which user space can call to specify bitmaps of MSR
ranges that KVM should reject access to.

In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to
trap and emulate MSRs that are outside of the scope of KVM as well as
limit the attack surface on KVM's MSR emulation code.

8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID
-----------------------------

Architectures: x86

When enabled, KVM will disable paravirtual features provided to the
guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf
(0x40000001). Otherwise, a guest may use the paravirtual features
regardless of what has actually been exposed through the CPUID leaf.

8.29 KVM_CAP_DIRTY_LOG_RING
---------------------------

:Architectures: x86
:Parameters: args[0] - size of the dirty log ring

KVM is capable of tracking dirty memory using ring buffers that are
mmaped into userspace; there is one dirty ring per vcpu.

The dirty ring is available to userspace as an array of
``struct kvm_dirty_gfn``.  Each dirty entry it's defined as::

  struct kvm_dirty_gfn {
          __u32 flags;
          __u32 slot; /* as_id | slot_id */
          __u64 offset;
  };

The following values are defined for the flags field to define the
current state of the entry::

  #define KVM_DIRTY_GFN_F_DIRTY           BIT(0)
  #define KVM_DIRTY_GFN_F_RESET           BIT(1)
  #define KVM_DIRTY_GFN_F_MASK            0x3

Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM
ioctl to enable this capability for the new guest and set the size of
the rings.  Enabling the capability is only allowed before creating any
vCPU, and the size of the ring must be a power of two.  The larger the
ring buffer, the less likely the ring is full and the VM is forced to
exit to userspace. The optimal size depends on the workload, but it is
recommended that it be at least 64 KiB (4096 entries).

Just like for dirty page bitmaps, the buffer tracks writes to
all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was
set in KVM_SET_USER_MEMORY_REGION.  Once a memory region is registered
with the flag set, userspace can start harvesting dirty pages from the
ring buffer.

An entry in the ring buffer can be unused (flag bits ``00``),
dirty (flag bits ``01``) or harvested (flag bits ``1X``).  The
state machine for the entry is as follows::

          dirtied         harvested        reset
     00 -----------> 01 -------------> 1X -------+
      ^                                          |
      |                                          |
      +------------------------------------------+

To harvest the dirty pages, userspace accesses the mmaped ring buffer
to read the dirty GFNs.  If the flags has the DIRTY bit set (at this stage
the RESET bit must be cleared), then it means this GFN is a dirty GFN.
The userspace should harvest this GFN and mark the flags from state
``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set
to show that this GFN is harvested and waiting for a reset), and move
on to the next GFN.  The userspace should continue to do this until the
flags of a GFN have the DIRTY bit cleared, meaning that it has harvested
all the dirty GFNs that were available.

It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.

After processing one or more entries in the ring buffer, userspace
calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about
it, so that the kernel will reprotect those collected GFNs.
Therefore, the ioctl must be called *before* reading the content of
the dirty pages.

The dirty ring can get full.  When it happens, the KVM_RUN of the
vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL.

The dirty ring interface has a major difference comparing to the
KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from
userspace, it's still possible that the kernel has not yet flushed the
processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the
flushing is done by the KVM_GET_DIRTY_LOG ioctl).  To achieve that, one
needs to kick the vcpu out of KVM_RUN using a signal.  The resulting
vmexit ensures that all dirty GFNs are flushed to the dirty rings.

NOTE: the capability KVM_CAP_DIRTY_LOG_RING and the corresponding
ioctl KVM_RESET_DIRTY_RINGS are mutual exclusive to the existing ioctls
KVM_GET_DIRTY_LOG and KVM_CLEAR_DIRTY_LOG.  After enabling
KVM_CAP_DIRTY_LOG_RING with an acceptable dirty ring size, the virtual
machine will switch to ring-buffer dirty page tracking and further
KVM_GET_DIRTY_LOG or KVM_CLEAR_DIRTY_LOG ioctls will fail.

8.30 KVM_CAP_XEN_HVM
--------------------

:Architectures: x86

This capability indicates the features that Xen supports for hosting Xen
PVHVM guests. Valid flags are::

  #define KVM_XEN_HVM_CONFIG_HYPERCALL_MSR	(1 << 0)
  #define KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL	(1 << 1)
  #define KVM_XEN_HVM_CONFIG_SHARED_INFO	(1 << 2)
  #define KVM_XEN_HVM_CONFIG_RUNSTATE		(1 << 2)

The KVM_XEN_HVM_CONFIG_HYPERCALL_MSR flag indicates that the KVM_XEN_HVM_CONFIG
ioctl is available, for the guest to set its hypercall page.

If KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL is also set, the same flag may also be
provided in the flags to KVM_XEN_HVM_CONFIG, without providing hypercall page
contents, to request that KVM generate hypercall page content automatically
and also enable interception of guest hypercalls with KVM_EXIT_XEN.

The KVM_XEN_HVM_CONFIG_SHARED_INFO flag indicates the availability of the
KVM_XEN_HVM_SET_ATTR, KVM_XEN_HVM_GET_ATTR, KVM_XEN_VCPU_SET_ATTR and
KVM_XEN_VCPU_GET_ATTR ioctls, as well as the delivery of exception vectors
for event channel upcalls when the evtchn_upcall_pending field of a vcpu's
vcpu_info is set.

The KVM_XEN_HVM_CONFIG_RUNSTATE flag indicates that the runstate-related
features KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADDR/_CURRENT/_DATA/_ADJUST are
supported by the KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR ioctls.

8.31 KVM_CAP_PPC_MULTITCE
-------------------------

:Capability: KVM_CAP_PPC_MULTITCE
:Architectures: ppc
:Type: vm

This capability means the kernel is capable of handling hypercalls
H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
space. This significantly accelerates DMA operations for PPC KVM guests.
User space should expect that its handlers for these hypercalls
are not going to be called if user space previously registered LIOBN
in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).

In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
user space might have to advertise it for the guest. For example,
IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
present in the "ibm,hypertas-functions" device-tree property.

The hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

This capability is always enabled.

8.32 KVM_CAP_PTP_KVM
--------------------

:Architectures: arm64

This capability indicates that the KVM virtual PTP service is
supported in the host. A VMM can check whether the service is
available to the guest on migration.

8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
---------------------------------

Architectures: x86

When enabled, KVM will disable emulated Hyper-V features provided to the
guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
currently implmented Hyper-V features are provided unconditionally when
Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
leaf.

8.34 KVM_CAP_EXIT_HYPERCALL
---------------------------

:Capability: KVM_CAP_EXIT_HYPERCALL
:Architectures: x86
:Type: vm

This capability, if enabled, will cause KVM to exit to userspace
with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.

Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
of hypercalls that can be configured to exit to userspace.
Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.

The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
of the result of KVM_CHECK_EXTENSION.  KVM will forward to userspace
the hypercalls whose corresponding bit is in the argument, and return
ENOSYS for the others.