Skip to content
api.txt 158 KiB
Newer Older
the kernel never checked for flags == 0 and QEMU never pre-zeroed flags and
reserved, these fields can not be used in the future without breaking
compatibility.

If -ENOBUFS is returned the buffer provided was too small and userspace
may retry with a bigger buffer.

4.95 KVM_S390_SET_IRQ_STATE

Capability: KVM_CAP_S390_IRQ_STATE
Architectures: s390
Type: vcpu ioctl
Parameters: struct kvm_s390_irq_state (in)
Returns: 0 on success,
         -EFAULT if the buffer address was invalid,
         -EINVAL for an invalid buffer length (see below),
         -EBUSY if there were already interrupts pending,
         errors occurring when actually injecting the
          interrupt. See KVM_S390_IRQ.

This ioctl allows userspace to set the complete state of all cpu-local
interrupts currently pending for the vcpu. It is intended for restoring
interrupt state after a migration. The input parameter is a userspace buffer
containing a struct kvm_s390_irq_state:

struct kvm_s390_irq_state {
	__u64 buf;
	__u32 flags;        /* will stay unused for compatibility reasons */
	__u32 reserved[4];  /* will stay unused for compatibility reasons */
The restrictions for flags and reserved apply as well.
(see KVM_S390_GET_IRQ_STATE)

The userspace memory referenced by buf contains a struct kvm_s390_irq
for each interrupt to be injected into the guest.
If one of the interrupts could not be injected for some reason the
ioctl aborts.

len must be a multiple of sizeof(struct kvm_s390_irq). It must be > 0
and it must not exceed (max_vcpus + 32) * sizeof(struct kvm_s390_irq),
which is the maximum number of possibly pending cpu-local interrupts.

Capability: KVM_CAP_X86_SMM
Architectures: x86
Type: vcpu ioctl
Parameters: none
Returns: 0 on success, -1 on error

Queues an SMI on the thread's vcpu.

4.97 KVM_CAP_PPC_MULTITCE

Capability: KVM_CAP_PPC_MULTITCE
Architectures: ppc
Type: vm

This capability means the kernel is capable of handling hypercalls
H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user
space. This significantly accelerates DMA operations for PPC KVM guests.
User space should expect that its handlers for these hypercalls
are not going to be called if user space previously registered LIOBN
in KVM (via KVM_CREATE_SPAPR_TCE or similar calls).

In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest,
user space might have to advertise it for the guest. For example,
IBM pSeries (sPAPR) guest starts using them if "hcall-multi-tce" is
present in the "ibm,hypertas-functions" device-tree property.

The hypercalls mentioned above may or may not be processed successfully
in the kernel based fast path. If they can not be handled by the kernel,
they will get passed on to user space. So user space still has to have
an implementation for these despite the in kernel acceleration.

This capability is always enabled.

4.98 KVM_CREATE_SPAPR_TCE_64

Capability: KVM_CAP_SPAPR_TCE_64
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_create_spapr_tce_64 (in)
Returns: file descriptor for manipulating the created TCE table

This is an extension for KVM_CAP_SPAPR_TCE which only supports 32bit
windows, described in 4.62 KVM_CREATE_SPAPR_TCE

This capability uses extended struct in ioctl interface:

/* for KVM_CAP_SPAPR_TCE_64 */
struct kvm_create_spapr_tce_64 {
	__u64 liobn;
	__u32 page_shift;
	__u32 flags;
	__u64 offset;	/* in pages */
	__u64 size; 	/* in pages */
};

The aim of extension is to support an additional bigger DMA window with
a variable page size.
KVM_CREATE_SPAPR_TCE_64 receives a 64bit window size, an IOMMU page shift and
a bus offset of the corresponding DMA window, @size and @offset are numbers
of IOMMU pages.

@flags are not used at the moment.

The rest of functionality is identical to KVM_CREATE_SPAPR_TCE.


Capability: KVM_CAP_REINJECT_CONTROL
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_reinject_control (in)
Returns: 0 on success,
         -EFAULT if struct kvm_reinject_control cannot be read,
         -ENXIO if KVM_CREATE_PIT or KVM_CREATE_PIT2 didn't succeed earlier.

i8254 (PIT) has two modes, reinject and !reinject.  The default is reinject,
where KVM queues elapsed i8254 ticks and monitors completion of interrupt from
vector(s) that i8254 injects.  Reinject mode dequeues a tick and injects its
interrupt whenever there isn't a pending interrupt from i8254.
!reinject mode injects an interrupt as soon as a tick arrives.

struct kvm_reinject_control {
	__u8 pit_reinject;
	__u8 reserved[31];
};

pit_reinject = 0 (!reinject mode) is recommended, unless running an old
operating system that uses the PIT for timing (e.g. Linux 2.4.x).

4.100 KVM_PPC_CONFIGURE_V3_MMU

Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
Architectures: ppc
Type: vm ioctl
Parameters: struct kvm_ppc_mmuv3_cfg (in)
Returns: 0 on success,
         -EFAULT if struct kvm_ppc_mmuv3_cfg cannot be read,
         -EINVAL if the configuration is invalid

This ioctl controls whether the guest will use radix or HPT (hashed
page table) translation, and sets the pointer to the process table for
the guest.

struct kvm_ppc_mmuv3_cfg {
	__u64	flags;
	__u64	process_table;
};

There are two bits that can be set in flags; KVM_PPC_MMUV3_RADIX and
KVM_PPC_MMUV3_GTSE.  KVM_PPC_MMUV3_RADIX, if set, configures the guest
to use radix tree translation, and if clear, to use HPT translation.
KVM_PPC_MMUV3_GTSE, if set and if KVM permits it, configures the guest
to be able to use the global TLB and SLB invalidation instructions;
if clear, the guest may not use these instructions.

The process_table field specifies the address and size of the guest
process table, which is in the guest's space.  This field is formatted
as the second doubleword of the partition table entry, as defined in
the Power ISA V3.00, Book III section 5.7.6.1.


Capability: KVM_CAP_PPC_RADIX_MMU
Architectures: ppc
Type: vm ioctl
Parameters: struct kvm_ppc_rmmu_info (out)
Returns: 0 on success,
	 -EFAULT if struct kvm_ppc_rmmu_info cannot be written,
	 -EINVAL if no useful information can be returned

This ioctl returns a structure containing two things: (a) a list
containing supported radix tree geometries, and (b) a list that maps
page sizes to put in the "AP" (actual page size) field for the tlbie
(TLB invalidate entry) instruction.

struct kvm_ppc_rmmu_info {
	struct kvm_ppc_radix_geom {
		__u8	page_shift;
		__u8	level_bits[4];
		__u8	pad[3];
	}	geometries[8];
	__u32	ap_encodings[8];
};

The geometries[] field gives up to 8 supported geometries for the
radix page table, in terms of the log base 2 of the smallest page
size, and the number of bits indexed at each level of the tree, from
the PTE level up to the PGD level in that order.  Any unused entries
will have 0 in the page_shift field.

The ap_encodings gives the supported page sizes and their AP field
encodings, encoded with the AP value in the top 3 bits and the log
base 2 of the page size in the bottom 6 bits.

4.102 KVM_PPC_RESIZE_HPT_PREPARE

Capability: KVM_CAP_SPAPR_RESIZE_HPT
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_ppc_resize_hpt (in)
Returns: 0 on successful completion,
	 >0 if a new HPT is being prepared, the value is an estimated
             number of milliseconds until preparation is complete
         -EFAULT if struct kvm_reinject_control cannot be read,
	 -EINVAL if the supplied shift or flags are invalid
	 -ENOMEM if unable to allocate the new HPT
	 -ENOSPC if there was a hash collision when moving existing
                  HPT entries to the new HPT
	 -EIO on other error conditions

Used to implement the PAPR extension for runtime resizing of a guest's
Hashed Page Table (HPT).  Specifically this starts, stops or monitors
the preparation of a new potential HPT for the guest, essentially
implementing the H_RESIZE_HPT_PREPARE hypercall.

If called with shift > 0 when there is no pending HPT for the guest,
this begins preparation of a new pending HPT of size 2^(shift) bytes.
It then returns a positive integer with the estimated number of
milliseconds until preparation is complete.

If called when there is a pending HPT whose size does not match that
requested in the parameters, discards the existing pending HPT and
creates a new one as above.

If called when there is a pending HPT of the size requested, will:
  * If preparation of the pending HPT is already complete, return 0
  * If preparation of the pending HPT has failed, return an error
    code, then discard the pending HPT.
  * If preparation of the pending HPT is still in progress, return an
    estimated number of milliseconds until preparation is complete.

If called with shift == 0, discards any currently pending HPT and
returns 0 (i.e. cancels any in-progress preparation).

flags is reserved for future expansion, currently setting any bits in
flags will result in an -EINVAL.

Normally this will be called repeatedly with the same parameters until
it returns <= 0.  The first call will initiate preparation, subsequent
ones will monitor preparation until it completes or fails.

struct kvm_ppc_resize_hpt {
	__u64 flags;
	__u32 shift;
	__u32 pad;
};

4.103 KVM_PPC_RESIZE_HPT_COMMIT

Capability: KVM_CAP_SPAPR_RESIZE_HPT
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_ppc_resize_hpt (in)
Returns: 0 on successful completion,
         -EFAULT if struct kvm_reinject_control cannot be read,
	 -EINVAL if the supplied shift or flags are invalid
	 -ENXIO is there is no pending HPT, or the pending HPT doesn't
                 have the requested size
	 -EBUSY if the pending HPT is not fully prepared
	 -ENOSPC if there was a hash collision when moving existing
                  HPT entries to the new HPT
	 -EIO on other error conditions

Used to implement the PAPR extension for runtime resizing of a guest's
Hashed Page Table (HPT).  Specifically this requests that the guest be
transferred to working with the new HPT, essentially implementing the
H_RESIZE_HPT_COMMIT hypercall.

This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
returned 0 with the same parameters.  In other cases
KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
-EBUSY, though others may be possible if the preparation was started,
but failed).

This will have undefined effects on the guest if it has not already
placed itself in a quiescent state where no vcpu will make MMU enabled
memory accesses.

On succsful completion, the pending HPT will become the guest's active
HPT and the previous HPT will be discarded.

On failure, the guest will still be operating on its previous HPT.

struct kvm_ppc_resize_hpt {
	__u64 flags;
	__u32 shift;
	__u32 pad;
};

4.104 KVM_X86_GET_MCE_CAP_SUPPORTED

Capability: KVM_CAP_MCE
Architectures: x86
Type: system ioctl
Parameters: u64 mce_cap (out)
Returns: 0 on success, -1 on error

Returns supported MCE capabilities. The u64 mce_cap parameter
has the same format as the MSR_IA32_MCG_CAP register. Supported
capabilities will have the corresponding bits set.

4.105 KVM_X86_SETUP_MCE

Capability: KVM_CAP_MCE
Architectures: x86
Type: vcpu ioctl
Parameters: u64 mcg_cap (in)
Returns: 0 on success,
         -EFAULT if u64 mcg_cap cannot be read,
         -EINVAL if the requested number of banks is invalid,
         -EINVAL if requested MCE capability is not supported.

Initializes MCE support for use. The u64 mcg_cap parameter
has the same format as the MSR_IA32_MCG_CAP register and
specifies which capabilities should be enabled. The maximum
supported number of error-reporting banks can be retrieved when
checking for KVM_CAP_MCE. The supported capabilities can be
retrieved with KVM_X86_GET_MCE_CAP_SUPPORTED.

4.106 KVM_X86_SET_MCE

Capability: KVM_CAP_MCE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_x86_mce (in)
Returns: 0 on success,
         -EFAULT if struct kvm_x86_mce cannot be read,
         -EINVAL if the bank number is invalid,
         -EINVAL if VAL bit is not set in status field.

Inject a machine check error (MCE) into the guest. The input
parameter is:

struct kvm_x86_mce {
	__u64 status;
	__u64 addr;
	__u64 misc;
	__u64 mcg_status;
	__u8 bank;
	__u8 pad1[7];
	__u64 pad2[3];
};

If the MCE being reported is an uncorrected error, KVM will
inject it as an MCE exception into the guest. If the guest
MCG_STATUS register reports that an MCE is in progress, KVM
causes an KVM_EXIT_SHUTDOWN vmexit.

Otherwise, if the MCE is a corrected error, KVM will just
store it in the corresponding bank (provided this bank is
not holding a previously reported uncorrected error).

4.107 KVM_S390_GET_CMMA_BITS

Capability: KVM_CAP_S390_CMMA_MIGRATION
Architectures: s390
Type: vm ioctl
Parameters: struct kvm_s390_cmma_log (in, out)
Returns: 0 on success, a negative value on error

This ioctl is used to get the values of the CMMA bits on the s390
architecture. It is meant to be used in two scenarios:
- During live migration to save the CMMA values. Live migration needs
  to be enabled via the KVM_REQ_START_MIGRATION VM property.
- To non-destructively peek at the CMMA values, with the flag
  KVM_S390_CMMA_PEEK set.

The ioctl takes parameters via the kvm_s390_cmma_log struct. The desired
values are written to a buffer whose location is indicated via the "values"
member in the kvm_s390_cmma_log struct.  The values in the input struct are
also updated as needed.
Each CMMA value takes up one byte.

struct kvm_s390_cmma_log {
	__u64 start_gfn;
	__u32 count;
	__u32 flags;
	union {
		__u64 remaining;
		__u64 mask;
	};
	__u64 values;
};

start_gfn is the number of the first guest frame whose CMMA values are
to be retrieved,

count is the length of the buffer in bytes,

values points to the buffer where the result will be written to.

If count is greater than KVM_S390_SKEYS_MAX, then it is considered to be
KVM_S390_SKEYS_MAX. KVM_S390_SKEYS_MAX is re-used for consistency with
other ioctls.

The result is written in the buffer pointed to by the field values, and
the values of the input parameter are updated as follows.

Depending on the flags, different actions are performed. The only
supported flag so far is KVM_S390_CMMA_PEEK.

The default behaviour if KVM_S390_CMMA_PEEK is not set is:
start_gfn will indicate the first page frame whose CMMA bits were dirty.
It is not necessarily the same as the one passed as input, as clean pages
are skipped.

count will indicate the number of bytes actually written in the buffer.
It can (and very often will) be smaller than the input value, since the
buffer is only filled until 16 bytes of clean values are found (which
are then not copied in the buffer). Since a CMMA migration block needs
the base address and the length, for a total of 16 bytes, we will send
back some clean data if there is some dirty data afterwards, as long as
the size of the clean data does not exceed the size of the header. This
allows to minimize the amount of data to be saved or transferred over
the network at the expense of more roundtrips to userspace. The next
invocation of the ioctl will skip over all the clean values, saving
potentially more than just the 16 bytes we found.

If KVM_S390_CMMA_PEEK is set:
the existing storage attributes are read even when not in migration
mode, and no other action is performed;

the output start_gfn will be equal to the input start_gfn,

the output count will be equal to the input count, except if the end of
memory has been reached.

In both cases:
the field "remaining" will indicate the total number of dirty CMMA values
still remaining, or 0 if KVM_S390_CMMA_PEEK is set and migration mode is
not enabled.

mask is unused.

values points to the userspace buffer where the result will be stored.

This ioctl can fail with -ENOMEM if not enough memory can be allocated to
complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
KVM_S390_CMMA_PEEK is not set but migration mode was not enabled, with
-EFAULT if the userspace address is invalid or if no page table is
present for the addresses (e.g. when using hugepages).

4.108 KVM_S390_SET_CMMA_BITS

Capability: KVM_CAP_S390_CMMA_MIGRATION
Architectures: s390
Type: vm ioctl
Parameters: struct kvm_s390_cmma_log (in)
Returns: 0 on success, a negative value on error

This ioctl is used to set the values of the CMMA bits on the s390
architecture. It is meant to be used during live migration to restore
the CMMA values, but there are no restrictions on its use.
The ioctl takes parameters via the kvm_s390_cmma_values struct.
Each CMMA value takes up one byte.

struct kvm_s390_cmma_log {
	__u64 start_gfn;
	__u32 count;
	__u32 flags;
	union {
		__u64 remaining;
		__u64 mask;
	};
	__u64 values;
};

start_gfn indicates the starting guest frame number,

count indicates how many values are to be considered in the buffer,

flags is not used and must be 0.

mask indicates which PGSTE bits are to be considered.

remaining is not used.

values points to the buffer in userspace where to store the values.

This ioctl can fail with -ENOMEM if not enough memory can be allocated to
complete the task, with -ENXIO if CMMA is not enabled, with -EINVAL if
the count field is too large (e.g. more than KVM_S390_CMMA_SIZE_MAX) or
if the flags field was not 0, with -EFAULT if the userspace address is
invalid, if invalid pages are written to (e.g. after the end of memory)
or if no page table is present for the addresses (e.g. when using
hugepages).


Capability: KVM_CAP_PPC_GET_CPU_CHAR
Architectures: powerpc
Type: vm ioctl
Parameters: struct kvm_ppc_cpu_char (out)
Returns: 0 on successful completion
	 -EFAULT if struct kvm_ppc_cpu_char cannot be written

This ioctl gives userspace information about certain characteristics
of the CPU relating to speculative execution of instructions and
possible information leakage resulting from speculative execution (see
CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754).  The information is
returned in struct kvm_ppc_cpu_char, which looks like this:

struct kvm_ppc_cpu_char {
	__u64	character;		/* characteristics of the CPU */
	__u64	behaviour;		/* recommended software behaviour */
	__u64	character_mask;		/* valid bits in character */
	__u64	behaviour_mask;		/* valid bits in behaviour */
};

For extensibility, the character_mask and behaviour_mask fields
indicate which bits of character and behaviour have been filled in by
the kernel.  If the set of defined bits is extended in future then
userspace will be able to tell whether it is running on a kernel that
knows about the new bits.

The character field describes attributes of the CPU which can help
with preventing inadvertent information disclosure - specifically,
whether there is an instruction to flash-invalidate the L1 data cache
(ori 30,30,0 or mtspr SPRN_TRIG2,rN), whether the L1 data cache is set
to a mode where entries can only be used by the thread that created
them, whether the bcctr[l] instruction prevents speculation, and
whether a speculation barrier instruction (ori 31,31,0) is provided.

The behaviour field describes actions that software should take to
prevent inadvertent information disclosure, and thus describes which
vulnerabilities the hardware is subject to; specifically whether the
L1 data cache should be flushed when returning to user mode from the
kernel, and whether a speculation barrier should be placed between an
array bounds check and the array access.

These fields use the same bit definitions as the new
H_GET_CPU_CHARACTERISTICS hypercall.


Capability: basic
Architectures: x86
Type: system
Parameters: an opaque platform specific structure (in/out)
Returns: 0 on success; -1 on error

If the platform supports creating encrypted VMs then this ioctl can be used
for issuing platform-specific memory encryption commands to manage those
encrypted VMs.

Currently, this ioctl is used for issuing Secure Encrypted Virtualization
(SEV) commands on AMD Processors. The SEV commands are defined in
Documentation/virtual/kvm/amd-memory-encryption.rst.

Capability: basic
Architectures: x86
Type: system
Parameters: struct kvm_enc_region (in)
Returns: 0 on success; -1 on error

This ioctl can be used to register a guest memory region which may
contain encrypted data (e.g. guest RAM, SMRAM etc).

It is used in the SEV-enabled guest. When encryption is enabled, a guest
memory region may contain encrypted data. The SEV memory encryption
engine uses a tweak such that two identical plaintext pages, each at
different locations will have differing ciphertexts. So swapping or
moving ciphertext of those pages will not result in plaintext being
swapped. So relocating (or migrating) physical backing pages for the SEV
guest will require some additional steps.

Note: The current SEV key management spec does not provide commands to
swap or migrate (move) ciphertext pages. Hence, for now we pin the guest
memory region registered with the ioctl.


Capability: basic
Architectures: x86
Type: system
Parameters: struct kvm_enc_region (in)
Returns: 0 on success; -1 on error

This ioctl can be used to unregister the guest memory region registered
with KVM_MEMORY_ENCRYPT_REG_REGION ioctl above.

4.113 KVM_HYPERV_EVENTFD

Capability: KVM_CAP_HYPERV_EVENTFD
Architectures: x86
Type: vm ioctl
Parameters: struct kvm_hyperv_eventfd (in)

This ioctl (un)registers an eventfd to receive notifications from the guest on
the specified Hyper-V connection id through the SIGNAL_EVENT hypercall, without
causing a user exit.  SIGNAL_EVENT hypercall with non-zero event flag number
(bits 24-31) still triggers a KVM_EXIT_HYPERV_HCALL user exit.

struct kvm_hyperv_eventfd {
	__u32 conn_id;
	__s32 fd;
	__u32 flags;
	__u32 padding[3];
};

The conn_id field should fit within 24 bits:

#define KVM_HYPERV_CONN_ID_MASK		0x00ffffff

The acceptable values for the flags field are:

#define KVM_HYPERV_EVENTFD_DEASSIGN	(1 << 0)

Returns: 0 on success,
	-EINVAL if conn_id or flags is outside the allowed range
	-ENOENT on deassign if the conn_id isn't registered
	-EEXIST on assign if the conn_id is already registered

4.114 KVM_GET_NESTED_STATE

Capability: KVM_CAP_NESTED_STATE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_nested_state (in/out)
Returns: 0 on success, -1 on error
Errors:
  E2BIG:     the total state size (including the fixed-size part of struct
             kvm_nested_state) exceeds the value of 'size' specified by
             the user; the size required will be written into size.

struct kvm_nested_state {
	__u16 flags;
	__u16 format;
	__u32 size;
	union {
		struct kvm_vmx_nested_state vmx;
		struct kvm_svm_nested_state svm;
		__u8 pad[120];
	};
	__u8 data[0];
};

#define KVM_STATE_NESTED_GUEST_MODE	0x00000001
#define KVM_STATE_NESTED_RUN_PENDING	0x00000002

#define KVM_STATE_NESTED_SMM_GUEST_MODE	0x00000001
#define KVM_STATE_NESTED_SMM_VMXON	0x00000002

struct kvm_vmx_nested_state {
	__u64 vmxon_pa;
	__u64 vmcs_pa;

	struct {
		__u16 flags;
	} smm;
};

This ioctl copies the vcpu's nested virtualization state from the kernel to
userspace.

The maximum size of the state, including the fixed-size part of struct
kvm_nested_state, can be retrieved by passing KVM_CAP_NESTED_STATE to
the KVM_CHECK_EXTENSION ioctl().

4.115 KVM_SET_NESTED_STATE

Capability: KVM_CAP_NESTED_STATE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_nested_state (in)
Returns: 0 on success, -1 on error

This copies the vcpu's kvm_nested_state struct from userspace to the kernel.  For
the definition of struct kvm_nested_state, see KVM_GET_NESTED_STATE.
Avi Kivity's avatar
Avi Kivity committed
5. The kvm_run structure
------------------------
Avi Kivity's avatar
Avi Kivity committed

Application code obtains a pointer to the kvm_run structure by
mmap()ing a vcpu fd.  From that point, application code can control
execution by changing fields in kvm_run prior to calling the KVM_RUN
ioctl, and obtain information about the reason KVM_RUN returned by
looking up structure members.

struct kvm_run {
	/* in */
	__u8 request_interrupt_window;

Request that KVM_RUN return when it becomes possible to inject external
interrupts into the guest.  Useful in conjunction with KVM_INTERRUPT.

	__u8 immediate_exit;

This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN
exits immediately, returning -EINTR.  In the common scenario where a
signal is used to "kick" a VCPU out of KVM_RUN, this field can be used
to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
Rather than blocking the signal outside KVM_RUN, userspace can set up
a signal handler that sets run->immediate_exit to a non-zero value.

This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.

	__u8 padding1[6];
Avi Kivity's avatar
Avi Kivity committed

	/* out */
	__u32 exit_reason;

When KVM_RUN has returned successfully (return value 0), this informs
application code why KVM_RUN has returned.  Allowable values for this
field are detailed below.

	__u8 ready_for_interrupt_injection;

If request_interrupt_window has been specified, this field indicates
an interrupt can be injected now with KVM_INTERRUPT.

	__u8 if_flag;

The value of the current interrupt flag.  Only valid if in-kernel
local APIC is not used.

	__u16 flags;

More architecture-specific flags detailing state of the VCPU that may
affect the device's behavior.  The only currently defined flag is
KVM_RUN_X86_SMM, which is valid on x86 machines and is set if the
VCPU is in system management mode.
Avi Kivity's avatar
Avi Kivity committed

	/* in (pre_kvm_run), out (post_kvm_run) */
	__u64 cr8;

The value of the cr8 register.  Only valid if in-kernel local APIC is
not used.  Both input and output.

	__u64 apic_base;

The value of the APIC BASE msr.  Only valid if in-kernel local
APIC is not used.  Both input and output.

	union {
		/* KVM_EXIT_UNKNOWN */
		struct {
			__u64 hardware_exit_reason;
		} hw;

If exit_reason is KVM_EXIT_UNKNOWN, the vcpu has exited due to unknown
reasons.  Further architecture-specific information is available in
hardware_exit_reason.

		/* KVM_EXIT_FAIL_ENTRY */
		struct {
			__u64 hardware_entry_failure_reason;
		} fail_entry;

If exit_reason is KVM_EXIT_FAIL_ENTRY, the vcpu could not be run due
to unknown reasons.  Further architecture-specific information is
available in hardware_entry_failure_reason.

		/* KVM_EXIT_EXCEPTION */
		struct {
			__u32 exception;
			__u32 error_code;
		} ex;

Unused.

		/* KVM_EXIT_IO */
		struct {
#define KVM_EXIT_IO_IN  0
#define KVM_EXIT_IO_OUT 1
			__u8 direction;
			__u8 size; /* bytes */
			__u16 port;
			__u32 count;
			__u64 data_offset; /* relative to kvm_run start */
		} io;

If exit_reason is KVM_EXIT_IO, then the vcpu has
Avi Kivity's avatar
Avi Kivity committed
executed a port I/O instruction which could not be satisfied by kvm.
data_offset describes where the data is located (KVM_EXIT_IO_OUT) or
where kvm expects application code to place the data for the next
KVM_RUN invocation (KVM_EXIT_IO_IN).  Data format is a packed array.
		/* KVM_EXIT_DEBUG */
Avi Kivity's avatar
Avi Kivity committed
		struct {
			struct kvm_debug_exit_arch arch;
		} debug;

If the exit_reason is KVM_EXIT_DEBUG, then a vcpu is processing a debug event
for which architecture specific information is returned.
Avi Kivity's avatar
Avi Kivity committed

		/* KVM_EXIT_MMIO */
		struct {
			__u64 phys_addr;
			__u8  data[8];
			__u32 len;
			__u8  is_write;
		} mmio;

If exit_reason is KVM_EXIT_MMIO, then the vcpu has
Avi Kivity's avatar
Avi Kivity committed
executed a memory-mapped I/O instruction which could not be satisfied
by kvm.  The 'data' member contains the written data if 'is_write' is
true, and should be filled by application code otherwise.

The 'data' member contains, in its first 'len' bytes, the value as it would
appear if the VCPU performed a load or store of the appropriate width directly
to the byte array.

NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and
      KVM_EXIT_EPR the corresponding
operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN.  The kernel side will first finish
incomplete operations and then check for pending signals.  Userspace
can re-enter the guest with an unmasked signal pending to complete
pending operations.

Avi Kivity's avatar
Avi Kivity committed
		/* KVM_EXIT_HYPERCALL */
		struct {
			__u64 nr;
			__u64 args[6];
			__u64 ret;
			__u32 longmode;
			__u32 pad;
		} hypercall;

Unused.  This was once used for 'hypercall to userspace'.  To implement
such functionality, use KVM_EXIT_IO (x86) or KVM_EXIT_MMIO (all except s390).
Note KVM_EXIT_IO is significantly faster than KVM_EXIT_MMIO.
Avi Kivity's avatar
Avi Kivity committed

		/* KVM_EXIT_TPR_ACCESS */
		struct {
			__u64 rip;
			__u32 is_write;
			__u32 pad;
		} tpr_access;

To be documented (KVM_TPR_ACCESS_REPORTING).

		/* KVM_EXIT_S390_SIEIC */
		struct {
			__u8 icptcode;
			__u64 mask; /* psw upper half */
			__u64 addr; /* psw lower half */
			__u16 ipa;
			__u32 ipb;
		} s390_sieic;

s390 specific.

		/* KVM_EXIT_S390_RESET */
#define KVM_S390_RESET_POR       1
#define KVM_S390_RESET_CLEAR     2
#define KVM_S390_RESET_SUBSYSTEM 4
#define KVM_S390_RESET_CPU_INIT  8
#define KVM_S390_RESET_IPL       16
		__u64 s390_reset_flags;

s390 specific.

		/* KVM_EXIT_S390_UCONTROL */
		struct {
			__u64 trans_exc_code;
			__u32 pgm_code;
		} s390_ucontrol;

s390 specific. A page fault has occurred for a user controlled virtual
machine (KVM_VM_S390_UNCONTROL) on it's host page table that cannot be
resolved by the kernel.
The program code and the translation exception code that were placed
in the cpu's lowcore are presented here as defined by the z Architecture
Principles of Operation Book in the Chapter for Dynamic Address Translation
(DAT)

Avi Kivity's avatar
Avi Kivity committed
		/* KVM_EXIT_DCR */
		struct {
			__u32 dcrn;
			__u32 data;
			__u8  is_write;
		} dcr;

Deprecated - was used for 440 KVM.
		/* KVM_EXIT_OSI */
		struct {
			__u64 gprs[32];
		} osi;

MOL uses a special hypercall interface it calls 'OSI'. To enable it, we catch
hypercalls and exit with this exit struct that contains all the guest gprs.

If exit_reason is KVM_EXIT_OSI, then the vcpu has triggered such a hypercall.
Userspace can now handle the hypercall and when it's done modify the gprs as
necessary. Upon guest entry all guest GPRs will then be replaced by the values
in this struct.

		/* KVM_EXIT_PAPR_HCALL */
		struct {
			__u64 nr;
			__u64 ret;
			__u64 args[9];
		} papr_hcall;

This is used on 64-bit PowerPC when emulating a pSeries partition,
e.g. with the 'pseries' machine type in qemu.  It occurs when the
guest does a hypercall using the 'sc 1' instruction.  The 'nr' field
contains the hypercall number (from the guest R3), and 'args' contains
the arguments (from the guest R4 - R12).  Userspace should put the
return code in 'ret' and any extra returned values in args[].
The possible hypercalls are defined in the Power Architecture Platform
Requirements (PAPR) document available from www.power.org (free
developer registration required to access it).

		/* KVM_EXIT_S390_TSCH */
		struct {
			__u16 subchannel_id;
			__u16 subchannel_nr;
			__u32 io_int_parm;
			__u32 io_int_word;
			__u32 ipb;
			__u8 dequeued;
		} s390_tsch;

s390 specific. This exit occurs when KVM_CAP_S390_CSS_SUPPORT has been enabled
and TEST SUBCHANNEL was intercepted. If dequeued is set, a pending I/O
interrupt for the target subchannel has been dequeued and subchannel_id,
subchannel_nr, io_int_parm and io_int_word contain the parameters for that
interrupt. ipb is needed for instruction parameter decoding.

		/* KVM_EXIT_EPR */
		struct {
			__u32 epr;
		} epr;

On FSL BookE PowerPC chips, the interrupt controller has a fast patch
interrupt acknowledge path to the core. When the core successfully
delivers an interrupt, it automatically populates the EPR register with
the interrupt vector number and acknowledges the interrupt inside
the interrupt controller.

In case the interrupt controller lives in user space, we need to do
the interrupt acknowledge cycle through it to fetch the next to be
delivered interrupt vector using this exit.

It gets triggered whenever both KVM_CAP_PPC_EPR are enabled and an
external interrupt has just been delivered into the guest. User space
should put the acknowledged interrupt vector into the 'epr' field.

		/* KVM_EXIT_SYSTEM_EVENT */
		struct {
#define KVM_SYSTEM_EVENT_SHUTDOWN       1
#define KVM_SYSTEM_EVENT_RESET          2
#define KVM_SYSTEM_EVENT_CRASH          3
			__u32 type;
			__u64 flags;
		} system_event;

If exit_reason is KVM_EXIT_SYSTEM_EVENT then the vcpu has triggered
a system-level event using some architecture specific mechanism (hypercall
or some special instruction). In case of ARM/ARM64, this is triggered using
HVC instruction based PSCI call from the vcpu. The 'type' field describes
the system-level event type. The 'flags' field describes architecture
specific flags for the system-level event.

Valid values for 'type' are:
  KVM_SYSTEM_EVENT_SHUTDOWN -- the guest has requested a shutdown of the
   VM. Userspace is not obliged to honour this, and if it does honour
   this does not need to destroy the VM synchronously (ie it may call
   KVM_RUN again before shutdown finally occurs).
  KVM_SYSTEM_EVENT_RESET -- the guest has requested a reset of the VM.
   As with SHUTDOWN, userspace can choose to ignore the request, or
   to schedule the reset to occur in the future and may call KVM_RUN again.
  KVM_SYSTEM_EVENT_CRASH -- the guest crash occurred and the guest
   has requested a crash condition maintenance. Userspace can choose
   to ignore the request, or to gather VM memory core dump and/or
   reset/shutdown of the VM.
		/* KVM_EXIT_IOAPIC_EOI */
		struct {
			__u8 vector;
		} eoi;

Indicates that the VCPU's in-kernel local APIC received an EOI for a
level-triggered IOAPIC interrupt.  This exit only triggers when the
IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled);
the userspace IOAPIC should process the EOI and retrigger the interrupt if
it is still asserted.  Vector is the LAPIC interrupt vector for which the
EOI was received.

		struct kvm_hyperv_exit {
#define KVM_EXIT_HYPERV_SYNIC          1
#define KVM_EXIT_HYPERV_HCALL          2
			__u32 type;
			union {
				struct {
					__u32 msr;
					__u64 control;
					__u64 evt_page;