api.rst

 - KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the
   VM. Userspace is not obliged to honour this, and if it does honour
   this does not need to destroy the VM synchronously (ie it may call
   KVM_RUN again before shutdown finally occurs).
 - KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM.
   As with SHUTDOWN, userspace can choose to ignore the request, or
   to schedule the reset to occur in the future and may call KVM_RUN again.
 - KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest
   has requested a crash condition maintenance. Userspace can choose
   to ignore the request, or to gather VM memory core dump and/or
   reset/shutdown of the VM.

Valid flags are:

 - KVM_SYSTEM_EVENT_RESET_FLAG_PSCI_RESET2 (arm64 only) -- the guest issued
   a SYSTEM_RESET2 call according to v1.1 of the PSCI specification.

::

		/* KVM_EXIT_IOAPIC_EOI */
		struct {
			__u8 vector;
		} eoi;

Indicates that the VCPU's in-kernel local APIC received an EOI for a
level-triggered IOAPIC interrupt.  This exit only triggers when the
IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled);
the userspace IOAPIC should process the EOI and retrigger the interrupt if
it is still asserted.  Vector is the LAPIC interrupt vector for which the
EOI was received.

::

		struct kvm_hyperv_exit {
  #define KVM_EXIT_HYPERV_SYNIC          1
  #define KVM_EXIT_HYPERV_HCALL          2
  #define KVM_EXIT_HYPERV_SYNDBG         3
			__u32 type;
			__u32 pad1;
			union {
				struct {
					__u32 msr;
					__u32 pad2;
					__u64 control;
					__u64 evt_page;
					__u64 msg_page;
				} synic;
				struct {
					__u64 input;
					__u64 result;
					__u64 params[2];
				} hcall;
				struct {
					__u32 msr;
					__u32 pad2;
					__u64 control;
					__u64 status;
					__u64 send_page;
					__u64 recv_page;
					__u64 pending_page;
				} syndbg;
			} u;
		};
		/* KVM_EXIT_HYPERV */
                struct kvm_hyperv_exit hyperv;

Indicates that the VCPU exits into userspace to process some tasks
related to Hyper-V emulation.

Valid values for 'type' are:

	- KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about

Hyper-V SynIC state change. Notification is used to remap SynIC
event/message pages and to enable/disable SynIC messages/events processing
in userspace.

	- KVM_EXIT_HYPERV_SYNDBG -- synchronously notify user-space about

Hyper-V Synthetic debugger state change. Notification is used to either update
the pending_page location or to send a control command (send the buffer located
in send_page or recv a buffer to recv_page).

::

		/* KVM_EXIT_ARM_NISV */
		struct {
			__u64 esr_iss;
			__u64 fault_ipa;
		} arm_nisv;

Used on arm64 systems. If a guest accesses memory not in a memslot,
KVM will typically return to userspace and ask it to do MMIO emulation on its
behalf. However, for certain classes of instructions, no instruction decode
(direction, length of memory access) is provided, and fetching and decoding
the instruction from the VM is overly complicated to live in the kernel.

Historically, when this situation occurred, KVM would print a warning and kill
the VM. KVM assumed that if the guest accessed non-memslot memory, it was
trying to do I/O, which just couldn't be emulated, and the warning message was
phrased accordingly. However, what happened more often was that a guest bug
caused access outside the guest memory areas which should lead to a more
meaningful warning message and an external abort in the guest, if the access
did not fall within an I/O window.

Userspace implementations can query for KVM_CAP_ARM_NISV_TO_USER, and enable
this capability at VM creation. Once this is done, these types of errors will
instead return to userspace with KVM_EXIT_ARM_NISV, with the valid bits from
the ESR_EL2 in the esr_iss field, and the faulting IPA in the fault_ipa field.
Userspace can either fix up the access if it's actually an I/O access by
decoding the instruction from guest memory (if it's very brave) and continue
executing the guest, or it can decide to suspend, dump, or restart the guest.

Note that KVM does not skip the faulting instruction as it does for
KVM_EXIT_MMIO, but userspace has to emulate any change to the processing state
if it decides to decode and emulate the instruction.

::

		/* KVM_EXIT_X86_RDMSR / KVM_EXIT_X86_WRMSR */
		struct {
			__u8 error; /* user -> kernel */
			__u8 pad[7];
			__u32 reason; /* kernel -> user */
			__u32 index; /* kernel -> user */
			__u64 data; /* kernel <-> user */
		} msr;

Used on x86 systems. When the VM capability KVM_CAP_X86_USER_SPACE_MSR is
enabled, MSR accesses to registers that would invoke a #GP by KVM kernel code
will instead trigger a KVM_EXIT_X86_RDMSR exit for reads and KVM_EXIT_X86_WRMSR
exit for writes.

The "reason" field specifies why the MSR trap occurred. User space will only
receive MSR exit traps when a particular reason was requested during through
ENABLE_CAP. Currently valid exit reasons are:

	KVM_MSR_EXIT_REASON_UNKNOWN - access to MSR that is unknown to KVM
	KVM_MSR_EXIT_REASON_INVAL - access to invalid MSRs or reserved bits
	KVM_MSR_EXIT_REASON_FILTER - access blocked by KVM_X86_SET_MSR_FILTER

For KVM_EXIT_X86_RDMSR, the "index" field tells user space which MSR the guest
wants to read. To respond to this request with a successful read, user space
writes the respective data into the "data" field and must continue guest
execution to ensure the read data is transferred into guest register state.

If the RDMSR request was unsuccessful, user space indicates that with a "1" in
the "error" field. This will inject a #GP into the guest when the VCPU is
executed again.

For KVM_EXIT_X86_WRMSR, the "index" field tells user space which MSR the guest
wants to write. Once finished processing the event, user space must continue
vCPU execution. If the MSR write was unsuccessful, user space also sets the
"error" field to "1".

::


		struct kvm_xen_exit {
  #define KVM_EXIT_XEN_HCALL          1
			__u32 type;
			union {
				struct {
					__u32 longmode;
					__u32 cpl;
					__u64 input;
					__u64 result;
					__u64 params[6];
				} hcall;
			} u;
		};
		/* KVM_EXIT_XEN */
                struct kvm_hyperv_exit xen;

Indicates that the VCPU exits into userspace to process some tasks
related to Xen emulation.

Valid values for 'type' are:

  - KVM_EXIT_XEN_HCALL -- synchronously notify user-space about Xen hypercall.
    Userspace is expected to place the hypercall result into the appropriate
    field before invoking KVM_RUN again.

::

		/* KVM_EXIT_RISCV_SBI */
		struct {
			unsigned long extension_id;
			unsigned long function_id;
			unsigned long args[6];
			unsigned long ret[2];
		} riscv_sbi;
If exit reason is KVM_EXIT_RISCV_SBI then it indicates that the VCPU has
done a SBI call which is not handled by KVM RISC-V kernel module. The details
of the SBI call are available in 'riscv_sbi' member of kvm_run structure. The
'extension_id' field of 'riscv_sbi' represents SBI extension ID whereas the
'function_id' field represents function ID of given SBI extension. The 'args'
array field of 'riscv_sbi' represents parameters for the SBI call and 'ret'
array field represents return values. The userspace should update the return
values of SBI call before resuming the VCPU. For more details on RISC-V SBI
spec refer, https://github.com/riscv/riscv-sbi-doc.

::

		/* Fix the size of the union. */
		char padding[256];
	};

	/*
	 * shared registers between kvm and userspace.
	 * kvm_valid_regs specifies the register classes set by the host
	 * kvm_dirty_regs specified the register classes dirtied by userspace
	 * struct kvm_sync_regs is architecture specific, as well as the
	 * bits for kvm_valid_regs and kvm_dirty_regs
	 */
	__u64 kvm_valid_regs;
	__u64 kvm_dirty_regs;
	union {
		struct kvm_sync_regs regs;
		char padding[SYNC_REGS_SIZE_BYTES];
	} s;

If KVM_CAP_SYNC_REGS is defined, these fields allow userspace to access
certain guest registers without having to call SET/GET_*REGS. Thus we can
avoid some system call overhead if userspace has to handle the exit.
Userspace can query the validity of the structure by checking
kvm_valid_regs for specific bits. These bits are architecture specific
and usually define the validity of a groups of registers. (e.g. one bit
for general purpose registers)

Please note that the kernel is allowed to use the kvm_run structure as the
primary storage for certain register types. Therefore, the kernel may use the
values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.

::

  };


6. Capabilities that can be enabled on vCPUs
============================================

There are certain capabilities that change the behavior of the virtual CPU or
the virtual machine when enabled. To enable them, please see section 4.37.
Below you can find a list of capabilities and what their effect on the vCPU or
the virtual machine is when enabling them.

The following information is provided along with the description:

  Architectures:
      which instruction set architectures provide this ioctl.
      x86 includes both i386 and x86_64.

  Target:
      whether this is a per-vcpu or per-vm capability.

  Parameters:
      what parameters are accepted by the capability.

  Returns:
      the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
      are not detailed, but errors with specific meanings are.


6.1 KVM_CAP_PPC_OSI
-------------------

:Architectures: ppc
:Target: vcpu
:Parameters: none
:Returns: 0 on success; -1 on error

This capability enables interception of OSI hypercalls that otherwise would
be treated as normal system calls to be injected into the guest. OSI hypercalls
were invented by Mac-on-Linux to have a standardized communication mechanism
between the guest and the host.

When this capability is enabled, KVM_EXIT_OSI can occur.


6.2 KVM_CAP_PPC_PAPR
--------------------

:Architectures: ppc
:Target: vcpu
:Parameters: none
:Returns: 0 on success; -1 on error

This capability enables interception of PAPR hypercalls. PAPR hypercalls are
done using the hypercall instruction "sc 1".

It also sets the guest privilege level to "supervisor" mode. Usually the guest
runs in "hypervisor" privilege mode with a few missing features.

In addition to the above, it changes the semantics of SDR1. In this mode, the
HTAB address part of SDR1 contains an HVA instead of a GPA, as PAPR keeps the
HTAB invisible to the guest.

When this capability is enabled, KVM_EXIT_PAPR_HCALL can occur.


6.3 KVM_CAP_SW_TLB
------------------

:Architectures: ppc
:Target: vcpu
:Parameters: args[0] is the address of a struct kvm_config_tlb
:Returns: 0 on success; -1 on error

::

  struct kvm_config_tlb {
	__u64 params;
	__u64 array;
	__u32 mmu_type;
	__u32 array_len;
  };

Configures the virtual CPU's TLB array, establishing a shared memory area
between userspace and KVM.  The "params" and "array" fields are userspace
addresses of mmu-type-specific data structures.  The "array_len" field is an
safety mechanism, and should be set to the size in bytes of the memory that
userspace has reserved for the array.  It must be at least the size dictated
by "mmu_type" and "params".

While KVM_RUN is active, the shared region is under control of KVM.  Its
contents are undefined, and any modification by userspace results in
boundedly undefined behavior.

On return from KVM_RUN, the shared region will reflect the current state of
the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
to tell KVM which entries have been changed, prior to calling KVM_RUN again
on this vcpu.

For mmu types KVM_MMU_FSL_BOOKE_NOHV and KVM_MMU_FSL_BOOKE_HV:

 - The "params" field is of type "struct kvm_book3e_206_tlb_params".
 - The "array" field points to an array of type "struct
   kvm_book3e_206_tlb_entry".
 - The array consists of all entries in the first TLB, followed by all
   entries in the second TLB.
 - Within a TLB, entries are ordered first by increasing set number.  Within a
   set, entries are ordered by way (increasing ESEL).
 - The hash for determining set number in TLB0 is: (MAS2 >> 12) & (num_sets - 1)
   where "num_sets" is the tlb_sizes[] value divided by the tlb_ways[] value.
 - The tsize field of mas1 shall be set to 4K on TLB0, even though the
   hardware ignores this value for TLB0.

6.4 KVM_CAP_S390_CSS_SUPPORT
----------------------------

:Architectures: s390
:Target: vcpu
:Parameters: none
:Returns: 0 on success; -1 on error

This capability enables support for handling of channel I/O instructions.

TEST PENDING INTERRUPTION and the interrupt portion of TEST SUBCHANNEL are
handled in-kernel, while the other I/O instructions are passed to userspace.

When this capability is enabled, KVM_EXIT_S390_TSCH will occur on TEST
SUBCHANNEL intercepts.

Note that even though this capability is enabled per-vcpu, the complete
virtual machine is affected.

6.5 KVM_CAP_PPC_EPR
-------------------

:Architectures: ppc
:Target: vcpu
:Parameters: args[0] defines whether the proxy facility is active
:Returns: 0 on success; -1 on error

This capability enables or disables the delivery of interrupts through the
external proxy facility.

When enabled (args[0] != 0), every time the guest gets an external interrupt
delivered, it automatically exits into user space with a KVM_EXIT_EPR exit
to receive the topmost interrupt vector.

When disabled (args[0] == 0), behavior is as if this facility is unsupported.

When this capability is enabled, KVM_EXIT_EPR can occur.

6.6 KVM_CAP_IRQ_MPIC
--------------------

:Architectures: ppc
:Parameters: args[0] is the MPIC device fd;
             args[1] is the MPIC CPU number for this vcpu

This capability connects the vcpu to an in-kernel MPIC device.

6.7 KVM_CAP_IRQ_XICS
--------------------

:Architectures: ppc
:Target: vcpu
:Parameters: args[0] is the XICS device fd;
             args[1] is the XICS CPU number (server ID) for this vcpu

This capability connects the vcpu to an in-kernel XICS device.

6.8 KVM_CAP_S390_IRQCHIP
------------------------

:Architectures: s390
:Target: vm
:Parameters: none

This capability enables the in-kernel irqchip for s390. Please refer to
"4.24 KVM_CREATE_IRQCHIP" for details.

6.9 KVM_CAP_MIPS_FPU
--------------------

:Architectures: mips
:Target: vcpu
:Parameters: args[0] is reserved for future use (should be 0).

This capability allows the use of the host Floating Point Unit by the guest. It
allows the Config1.FP bit to be set to enable the FPU in the guest. Once this is
done the ``KVM_REG_MIPS_FPR_*`` and ``KVM_REG_MIPS_FCR_*`` registers can be
accessed (depending on the current guest FPU register mode), and the Status.FR,
Config5.FRE bits are accessible via the KVM API and also from the guest,
depending on them being supported by the FPU.

6.10 KVM_CAP_MIPS_MSA
---------------------

:Architectures: mips
:Target: vcpu
:Parameters: args[0] is reserved for future use (should be 0).

This capability allows the use of the MIPS SIMD Architecture (MSA) by the guest.
It allows the Config3.MSAP bit to be set to enable the use of MSA by the guest.
Once this is done the ``KVM_REG_MIPS_VEC_*`` and ``KVM_REG_MIPS_MSA_*``
registers can be accessed, and the Config5.MSAEn bit is accessible via the
KVM API and also from the guest.

6.74 KVM_CAP_SYNC_REGS
----------------------

:Architectures: s390, x86
:Target: s390: always enabled, x86: vcpu
:Parameters: none
:Returns: x86: KVM_CHECK_EXTENSION returns a bit-array indicating which register
          sets are supported
          (bitfields defined in arch/x86/include/uapi/asm/kvm.h).

As described above in the kvm_sync_regs struct info in section 5 (kvm_run):
KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers
without having to call SET/GET_*REGS". This reduces overhead by eliminating
repeated ioctl calls for setting and/or getting register values. This is
particularly important when userspace is making synchronous guest state
modifications, e.g. when emulating and/or intercepting instructions in
userspace.

For s390 specifics, please refer to the source code.

For x86:

- the register sets to be copied out to kvm_run are selectable
  by userspace (rather that all sets being copied out for every exit).
- vcpu_events are available in addition to regs and sregs.

For x86, the 'kvm_valid_regs' field of struct kvm_run is overloaded to
function as an input bit-array field set by userspace to indicate the
specific register sets to be copied out on the next exit.

To indicate when userspace has modified values that should be copied into
the vCPU, the all architecture bitarray field, 'kvm_dirty_regs' must be set.
This is done using the same bitflags as for the 'kvm_valid_regs' field.
If the dirty bit is not set, then the register set values will not be copied
into the vCPU even if they've been modified.

Unused bitfields in the bitarrays must be set to zero.

::

  struct kvm_sync_regs {
        struct kvm_regs regs;
        struct kvm_sregs sregs;
        struct kvm_vcpu_events events;
  };

6.75 KVM_CAP_PPC_IRQ_XIVE
-------------------------

:Architectures: ppc
:Target: vcpu
:Parameters: args[0] is the XIVE device fd;
             args[1] is the XIVE CPU number (server ID) for this vcpu

This capability connects the vcpu to an in-kernel XIVE device.

7. Capabilities that can be enabled on VMs
==========================================

There are certain capabilities that change the behavior of the virtual
machine when enabled. To enable them, please see section 4.37. Below
you can find a list of capabilities and what their effect on the VM
is when enabling them.

The following information is provided along with the description:

  Architectures:
      which instruction set architectures provide this ioctl.
      x86 includes both i386 and x86_64.

  Parameters:
      what parameters are accepted by the capability.

  Returns:
      the return value.  General error numbers (EBADF, ENOMEM, EINVAL)
      are not detailed, but errors with specific meanings are.


7.1 KVM_CAP_PPC_ENABLE_HCALL
----------------------------

:Architectures: ppc
:Parameters: args[0] is the sPAPR hcall number;
	     args[1] is 0 to disable, 1 to enable in-kernel handling

This capability controls whether individual sPAPR hypercalls (hcalls)
get handled by the kernel or not.  Enabling or disabling in-kernel
handling of an hcall is effective across the VM.  On creation, an
initial set of hcalls are enabled for in-kernel handling, which
consists of those hcalls for which in-kernel handlers were implemented
before this capability was implemented.  If disabled, the kernel will
not to attempt to handle the hcall, but will always exit to userspace
to handle it.  Note that it may not make sense to enable some and
disable others of a group of related hcalls, but KVM does not prevent
userspace from doing that.

If the hcall number specified is not one that has an in-kernel
implementation, the KVM_ENABLE_CAP ioctl will fail with an EINVAL
error.

7.2 KVM_CAP_S390_USER_SIGP
--------------------------

:Architectures: s390
:Parameters: none

This capability controls which SIGP orders will be handled completely in user
space. With this capability enabled, all fast orders will be handled completely
in the kernel:

- SENSE
- SENSE RUNNING
- EXTERNAL CALL
- EMERGENCY SIGNAL
- CONDITIONAL EMERGENCY SIGNAL

All other orders will be handled completely in user space.

Only privileged operation exceptions will be checked for in the kernel (or even
in the hardware prior to interception). If this capability is not enabled, the
old way of handling SIGP orders is used (partially in kernel and user space).

7.3 KVM_CAP_S390_VECTOR_REGISTERS
---------------------------------

:Architectures: s390
:Parameters: none
:Returns: 0 on success, negative value on error

Allows use of the vector registers introduced with z13 processor, and
provides for the synchronization between host and user space.  Will
return -EINVAL if the machine does not support vectors.

7.4 KVM_CAP_S390_USER_STSI
--------------------------

:Architectures: s390
:Parameters: none

This capability allows post-handlers for the STSI instruction. After
initial handling in the kernel, KVM exits to user space with
KVM_EXIT_S390_STSI to allow user space to insert further data.

Before exiting to userspace, kvm handlers should fill in s390_stsi field of
vcpu->run::

  struct {
	__u64 addr;
	__u8 ar;
	__u8 reserved;
	__u8 fc;
	__u8 sel1;
	__u16 sel2;
  } s390_stsi;

  @addr - guest address of STSI SYSIB
  @fc   - function code
  @sel1 - selector 1
  @sel2 - selector 2
  @ar   - access register number

KVM handlers should exit to userspace with rc = -EREMOTE.

7.5 KVM_CAP_SPLIT_IRQCHIP
-------------------------

:Architectures: x86
:Parameters: args[0] - number of routes reserved for userspace IOAPICs
:Returns: 0 on success, -1 on error

Create a local apic for each processor in the kernel. This can be used
instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the
IOAPIC and PIC (and also the PIT, even though this has to be enabled
separately).

This capability also enables in kernel routing of interrupt requests;
when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are
used in the IRQ routing table.  The first args[0] MSI routes are reserved
for the IOAPIC pins.  Whenever the LAPIC receives an EOI for these routes,
a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace.

Fails if VCPU has already been created, or if the irqchip is already in the
kernel (i.e. KVM_CREATE_IRQCHIP has already been called).

7.6 KVM_CAP_S390_RI
-------------------

:Architectures: s390
:Parameters: none

Allows use of runtime-instrumentation introduced with zEC12 processor.
Will return -EINVAL if the machine does not support runtime-instrumentation.
Will return -EBUSY if a VCPU has already been created.

7.7 KVM_CAP_X2APIC_API
----------------------

:Architectures: x86
:Parameters: args[0] - features that should be enabled
:Returns: 0 on success, -EINVAL when args[0] contains invalid features

Valid feature flags in args[0] are::

  #define KVM_X2APIC_API_USE_32BIT_IDS            (1ULL << 0)
  #define KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK  (1ULL << 1)

Enabling KVM_X2APIC_API_USE_32BIT_IDS changes the behavior of
KVM_SET_GSI_ROUTING, KVM_SIGNAL_MSI, KVM_SET_LAPIC, and KVM_GET_LAPIC,
allowing the use of 32-bit APIC IDs.  See KVM_CAP_X2APIC_API in their
respective sections.

KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK must be enabled for x2APIC to work
in logical mode or with more than 255 VCPUs.  Otherwise, KVM treats 0xff
as a broadcast even in x2APIC mode in order to support physical x2APIC
without interrupt remapping.  This is undesirable in logical mode,
where 0xff represents CPUs 0-7 in cluster 0.

7.8 KVM_CAP_S390_USER_INSTR0
----------------------------

:Architectures: s390
:Parameters: none

With this capability enabled, all illegal instructions 0x0000 (2 bytes) will
be intercepted and forwarded to user space. User space can use this
mechanism e.g. to realize 2-byte software breakpoints. The kernel will
not inject an operating exception for these instructions, user space has
to take care of that.

This capability can be enabled dynamically even if VCPUs were already
created and are running.

7.9 KVM_CAP_S390_GS
-------------------

:Architectures: s390
:Parameters: none
:Returns: 0 on success; -EINVAL if the machine does not support
          guarded storage; -EBUSY if a VCPU has already been created.

Allows use of guarded storage for the KVM guest.

7.10 KVM_CAP_S390_AIS
---------------------

:Architectures: s390
:Parameters: none

Allow use of adapter-interruption suppression.
:Returns: 0 on success; -EBUSY if a VCPU has already been created.

7.11 KVM_CAP_PPC_SMT
--------------------

:Architectures: ppc
:Parameters: vsmt_mode, flags

Enabling this capability on a VM provides userspace with a way to set
the desired virtual SMT mode (i.e. the number of virtual CPUs per
virtual core).  The virtual SMT mode, vsmt_mode, must be a power of 2
between 1 and 8.  On POWER8, vsmt_mode must also be no greater than
the number of threads per subcore for the host.  Currently flags must
be 0.  A successful call to enable this capability will result in
vsmt_mode being returned when the KVM_CAP_PPC_SMT capability is
subsequently queried for the VM.  This capability is only supported by
HV KVM, and can only be set before any VCPUs have been created.
The KVM_CAP_PPC_SMT_POSSIBLE capability indicates which virtual SMT
modes are available.

7.12 KVM_CAP_PPC_FWNMI
----------------------

:Architectures: ppc
:Parameters: none

With this capability a machine check exception in the guest address
space will cause KVM to exit the guest with NMI exit reason. This
enables QEMU to build error log and branch to guest kernel registered
machine check handling routine. Without this capability KVM will
branch to guests' 0x200 interrupt vector.

7.13 KVM_CAP_X86_DISABLE_EXITS
------------------------------

:Architectures: x86
:Parameters: args[0] defines which exits are disabled
:Returns: 0 on success, -EINVAL when args[0] contains invalid exits

Valid bits in args[0] are::

  #define KVM_X86_DISABLE_EXITS_MWAIT            (1 << 0)
  #define KVM_X86_DISABLE_EXITS_HLT              (1 << 1)
  #define KVM_X86_DISABLE_EXITS_PAUSE            (1 << 2)
  #define KVM_X86_DISABLE_EXITS_CSTATE           (1 << 3)

Enabling this capability on a VM provides userspace with a way to no
longer intercept some instructions for improved latency in some
workloads, and is suggested when vCPUs are associated to dedicated
physical CPUs.  More bits can be added in the future; userspace can
just pass the KVM_CHECK_EXTENSION result to KVM_ENABLE_CAP to disable
all such vmexits.

Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits.

7.14 KVM_CAP_S390_HPAGE_1M
--------------------------

:Architectures: s390
:Parameters: none
:Returns: 0 on success, -EINVAL if hpage module parameter was not set
	  or cmma is enabled, or the VM has the KVM_VM_S390_UCONTROL
	  flag set

With this capability the KVM support for memory backing with 1m pages
through hugetlbfs can be enabled for a VM. After the capability is
enabled, cmma can't be enabled anymore and pfmfi and the storage key
interpretation are disabled. If cmma has already been enabled or the
hpage module parameter is not set to 1, -EINVAL is returned.

While it is generally possible to create a huge page backed VM without
this capability, the VM will not be able to run.

7.15 KVM_CAP_MSR_PLATFORM_INFO
------------------------------

:Architectures: x86
:Parameters: args[0] whether feature should be enabled or not

With this capability, a guest may read the MSR_PLATFORM_INFO MSR. Otherwise,
a #GP would be raised when the guest tries to access. Currently, this
capability does not enable write permissions of this MSR for the guest.

7.16 KVM_CAP_PPC_NESTED_HV
--------------------------

:Architectures: ppc
:Parameters: none
:Returns: 0 on success, -EINVAL when the implementation doesn't support
	  nested-HV virtualization.

HV-KVM on POWER9 and later systems allows for "nested-HV"
virtualization, which provides a way for a guest VM to run guests that
can run using the CPU's supervisor mode (privileged non-hypervisor
state).  Enabling this capability on a VM depends on the CPU having
the necessary functionality and on the facility being enabled with a
kvm-hv module parameter.

7.17 KVM_CAP_EXCEPTION_PAYLOAD
------------------------------

:Architectures: x86
:Parameters: args[0] whether feature should be enabled or not

With this capability enabled, CR2 will not be modified prior to the
emulated VM-exit when L1 intercepts a #PF exception that occurs in
L2. Similarly, for kvm-intel only, DR6 will not be modified prior to
the emulated VM-exit when L1 intercepts a #DB exception that occurs in
L2. As a result, when KVM_GET_VCPU_EVENTS reports a pending #PF (or
#DB) exception for L2, exception.has_payload will be set and the
faulting address (or the new DR6 bits*) will be reported in the
exception_payload field. Similarly, when userspace injects a #PF (or
#DB) into L2 using KVM_SET_VCPU_EVENTS, it is expected to set
exception.has_payload and to put the faulting address - or the new DR6
bits\ [#]_ - in the exception_payload field.

This capability also enables exception.pending in struct
kvm_vcpu_events, which allows userspace to distinguish between pending
and injected exceptions.


.. [#] For the new DR6 bits, note that bit 16 is set iff the #DB exception
       will clear DR6.RTM.

7.18 KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2

:Architectures: x86, arm64, mips
:Parameters: args[0] whether feature should be enabled or not

Valid flags are::

  #define KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE   (1 << 0)
  #define KVM_DIRTY_LOG_INITIALLY_SET           (1 << 1)

With KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE is set, KVM_GET_DIRTY_LOG will not
automatically clear and write-protect all pages that are returned as dirty.
Rather, userspace will have to do this operation separately using
KVM_CLEAR_DIRTY_LOG.

At the cost of a slightly more complicated operation, this provides better
scalability and responsiveness for two reasons.  First,
KVM_CLEAR_DIRTY_LOG ioctl can operate on a 64-page granularity rather
than requiring to sync a full memslot; this ensures that KVM does not
take spinlocks for an extended period of time.  Second, in some cases a
large amount of time can pass between a call to KVM_GET_DIRTY_LOG and
userspace actually using the data in the page.  Pages can be modified
during this time, which is inefficient for both the guest and userspace:
the guest will incur a higher penalty due to write protection faults,
while userspace can see false reports of dirty pages.  Manual reprotection
helps reducing this time, improving guest performance and reducing the
number of dirty log false positives.

With KVM_DIRTY_LOG_INITIALLY_SET set, all the bits of the dirty bitmap
will be initialized to 1 when created.  This also improves performance because
dirty logging can be enabled gradually in small chunks on the first call
to KVM_CLEAR_DIRTY_LOG.  KVM_DIRTY_LOG_INITIALLY_SET depends on
KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on
x86 and arm64 for now).

KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make
it hard or impossible to use it correctly.  The availability of
KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 signals that those bugs are fixed.
Userspace should not try to use KVM_CAP_MANUAL_DIRTY_LOG_PROTECT.

7.19 KVM_CAP_PPC_SECURE_GUEST
------------------------------

:Architectures: ppc

This capability indicates that KVM is running on a host that has
ultravisor firmware and thus can support a secure guest.  On such a
system, a guest can ask the ultravisor to make it a secure guest,
one whose memory is inaccessible to the host except for pages which
are explicitly requested to be shared with the host.  The ultravisor
notifies KVM when a guest requests to become a secure guest, and KVM
has the opportunity to veto the transition.

If present, this capability can be enabled for a VM, meaning that KVM
will allow the transition to secure guest mode.  Otherwise KVM will
veto the transition.

7.20 KVM_CAP_HALT_POLL
----------------------

:Architectures: all
:Target: VM
:Parameters: args[0] is the maximum poll time in nanoseconds
:Returns: 0 on success; -1 on error

This capability overrides the kvm module parameter halt_poll_ns for the
target VM.

VCPU polling allows a VCPU to poll for wakeup events instead of immediately
scheduling during guest halts. The maximum time a VCPU can spend polling is
controlled by the kvm module parameter halt_poll_ns. This capability allows
the maximum halt time to specified on a per-VM basis, effectively overriding
the module parameter for the target VM.

7.21 KVM_CAP_X86_USER_SPACE_MSR
-------------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] contains the mask of KVM_MSR_EXIT_REASON_* events to report
:Returns: 0 on success; -1 on error

This capability enables trapping of #GP invoking RDMSR and WRMSR instructions
into user space.

When a guest requests to read or write an MSR, KVM may not implement all MSRs
that are relevant to a respective system. It also does not differentiate by
CPU type.

To allow more fine grained control over MSR handling, user space may enable
this capability. With it enabled, MSR accesses that match the mask specified in
args[0] and trigger a #GP event inside the guest by KVM will instead trigger
KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR exit notifications which user space
can then handle to implement model specific MSR handling and/or user notifications
to inform a user that an MSR was not handled.

7.22 KVM_CAP_X86_BUS_LOCK_EXIT
-------------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] defines the policy used when bus locks detected in guest
:Returns: 0 on success, -EINVAL when args[0] contains invalid bits

Valid bits in args[0] are::

  #define KVM_BUS_LOCK_DETECTION_OFF      (1 << 0)
  #define KVM_BUS_LOCK_DETECTION_EXIT     (1 << 1)

Enabling this capability on a VM provides userspace with a way to select
a policy to handle the bus locks detected in guest. Userspace can obtain
the supported modes from the result of KVM_CHECK_EXTENSION and define it
through the KVM_ENABLE_CAP.

KVM_BUS_LOCK_DETECTION_OFF and KVM_BUS_LOCK_DETECTION_EXIT are supported
currently and mutually exclusive with each other. More bits can be added in
the future.

With KVM_BUS_LOCK_DETECTION_OFF set, bus locks in guest will not cause vm exits
so that no additional actions are needed. This is the default mode.

With KVM_BUS_LOCK_DETECTION_EXIT set, vm exits happen when bus lock detected
in VM. KVM just exits to userspace when handling them. Userspace can enforce
its own throttling or other policy based mitigations.

This capability is aimed to address the thread that VM can exploit bus locks to
degree the performance of the whole system. Once the userspace enable this
capability and select the KVM_BUS_LOCK_DETECTION_EXIT mode, KVM will set the
KVM_RUN_BUS_LOCK flag in vcpu-run->flags field and exit to userspace. Concerning
the bus lock vm exit can be preempted by a higher priority VM exit, the exit
notifications to userspace can be KVM_EXIT_BUS_LOCK or other reasons.
KVM_RUN_BUS_LOCK flag is used to distinguish between them.

7.23 KVM_CAP_PPC_DAWR1
----------------------

:Architectures: ppc
:Parameters: none
:Returns: 0 on success, -EINVAL when CPU doesn't support 2nd DAWR

This capability can be used to check / enable 2nd DAWR feature provided
by POWER10 processor.


7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
-------------------------------------

Architectures: x86 SEV enabled
Type: vm
Parameters: args[0] is the fd of the source vm
Returns: 0 on success; ENOTTY on error

This capability enables userspace to copy encryption context from the vm
indicated by the fd to the vm this is called on.

This is intended to support in-guest workloads scheduled by the host. This
allows the in-guest workload to maintain its own NPTs and keeps the two vms
from accidentally clobbering each other with interrupts and the like (separate
APIC/MSRs/etc).

7.25 KVM_CAP_SGX_ATTRIBUTE
--------------------------

:Architectures: x86
:Target: VM
:Parameters: args[0] is a file handle of a SGX attribute file in securityfs
:Returns: 0 on success, -EINVAL if the file handle is invalid or if a requested
          attribute is not supported by KVM.

KVM_CAP_SGX_ATTRIBUTE enables a userspace VMM to grant a VM access to one or
more priveleged enclave attributes.  args[0] must hold a file handle to a valid
SGX attribute file corresponding to an attribute that is supported/restricted
by KVM (currently only PROVISIONKEY).

The SGX subsystem restricts access to a subset of enclave attributes to provide
additional security for an uncompromised kernel, e.g. use of the PROVISIONKEY
is restricted to deter malware from using the PROVISIONKEY to obtain a stable
system fingerprint.  To prevent userspace from circumventing such restrictions
by running an enclave in a VM, KVM prevents access to privileged attributes by
default.

See Documentation/x86/sgx.rst for more details.