Commits · d6b6d166864fa97ca3b1ed1a5c62fd3b53d4606f · jan.koester / Linux

Mar 05, 2012

KVM: s390: ucontrol: export SIE control block to user · 5b1c1493

Carsten Otte authored Jan 04, 2012



This patch exports the s390 SIE hardware control block to userspace
via the mapping of the vcpu file descriptor. In order to do so,
a new arch callback named kvm_arch_vcpu_fault  is introduced for all
architectures. It allows to map architecture specific pages.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

5b1c1493

KVM: s390: ucontrol: export page faults to user · e168bf8d

Carsten Otte authored Jan 04, 2012



This patch introduces a new exit reason in the kvm_run structure
named KVM_EXIT_S390_UCONTROL. This exit indicates, that a virtual cpu
has regognized a fault on the host page table. The idea is that
userspace can handle this fault by mapping memory at the fault
location into the cpu's address space and then continue to run the
virtual cpu.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

e168bf8d

KVM: s390: ucontrol: per vcpu address spaces · 27e0393f

Carsten Otte authored Jan 04, 2012



This patch introduces two ioctls for virtual cpus, that are only
valid for kernel virtual machines that are controlled by userspace.
Each virtual cpu has its individual address space in this mode of
operation, and each address space is backed by the gmap
implementation just like the address space for regular KVM guests.
KVM_S390_UCAS_MAP allows to map a part of the user's virtual address
space to the vcpu. Starting offset and length in both the user and
the vcpu address space need to be aligned to 1M.
KVM_S390_UCAS_UNMAP can be used to unmap a range of memory from a
virtual cpu in a similar way.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

27e0393f

KVM: s390: add parameter for KVM_CREATE_VM · e08b9637

Carsten Otte authored Jan 04, 2012



This patch introduces a new config option for user controlled kernel
virtual machines. It introduces a parameter to KVM_CREATE_VM that
allows to set bits that alter the capabilities of the newly created
virtual machine.
The parameter is passed to kvm_arch_init_vm for all architectures.
The only valid modifier bit for now is KVM_VM_S390_UCONTROL.
This requires CAP_SYS_ADMIN privileges and creates a user controlled
virtual machine on s390 architectures.

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

e08b9637

Jan 27, 2012

lguest: remove reference from Documentation/virtual/00-INDEX · 13289d5f

Rusty Russell authored Jan 28, 2012



We're in tools/lguest now.

Reported-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

13289d5f

Jan 12, 2012

lguest: move the lguest tool to the tools directory · 07fe9977

Davidlohr Bueso authored Jan 12, 2012



This is a better location instead of having it in Documentation.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (fixed compile)

07fe9977

Dec 27, 2011
- KVM: Document KVM_NMI · 3f745f1e
  Avi Kivity authored Dec 07, 2011
  
  Signed-off-by: Avi Kivity <avi@redhat.com>
  3f745f1e
Dec 26, 2011

KVM: Don't automatically expose the TSC deadline timer in cpuid · 4d25a066

Jan Kiszka authored Dec 21, 2011



Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]

Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

4d25a066

Dec 25, 2011

KVM: Device assignment permission checks · 3d27e23b

Alex Williamson authored Dec 20, 2011



Only allow KVM device assignment to attach to devices which:

 - Are not bridges
 - Have BAR resources (assume others are special devices)
 - The user has permissions to use

Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers.  We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain.  We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files.  By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.

[Yang Bai: add missing #include]
[avi: fix comment style]

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Yang Bai <hamo.by@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

3d27e23b

KVM: Remove ability to assign a device without iommu support · 42387373

Alex Williamson authored Dec 20, 2011



This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection.  Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

42387373

Nov 02, 2011

UserModeLinux-HOWTO.txt: fix a typo · c039aff6

Jonathan Neuschäfer authored Aug 15, 2011



Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Richard Weinberger <richard@nod.at>

c039aff6

UserModeLinux-HOWTO.txt: remove ^H characters · 8a91db29

Jonathan Neuschäfer authored Aug 12, 2011



If you can't read this patch, please run:

	sed -i -e "s/[^\o10]\o10//g" \
		Documentation/virtual/uml/UserModeLinux-HOWTO.txt

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Richard Weinberger <richard@nod.at>

8a91db29

Sep 27, 2011

doc: fix broken references · 395cf969

Paul Bolle authored Aug 15, 2011



There are numerous broken references to Documentation files (in other
Documentation files, in comments, etc.). These broken references are
caused by typo's in the references, and by renames or removals of the
Documentation files. Some broken references are simply odd.

Fix these broken references, sometimes by dropping the irrelevant text
they were part of.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>

395cf969

Sep 25, 2011

KVM: Update documentation to include detailed ENABLE_CAP description · 821246a5

Alexander Graf authored Aug 31, 2011



We have an ioctl that enables capabilities individually, but no description
on what exactly happens when we enable a capability using this ioctl.

This patch adds documentation for capability enabling in a new section
of the API documentation.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>

821246a5

KVM: Restore missing powerpc API docs · 36442687

Avi Kivity authored Aug 29, 2011



Commit 371fefd6 lost a doc hunk somehow, restore it.

Signed-off-by: Avi Kivity <avi@redhat.com>

36442687

KVM: x86: Raise the hard VCPU count limit · 8c3ba334

Sasha Levin authored Jul 18, 2011



The patch raises the hard limit of VCPU count to 254.

This will allow developers to easily work on scalability
and will allow users to test high VCPU setups easily without
patching the kernel.

To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
now returns the recommended VCPU limit (which is still 64) - this
should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
returns the hard limit which is now 254.

Cc: Avi Kivity <avi@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

8c3ba334

Aug 15, 2011

lguest: allow booting guest with CONFIG_RELOCATABLE=y · e22a5398

Rusty Russell authored Aug 15, 2011



The CONFIG_RELOCATABLE code tries to align the unpack destination to
the value of 'kernel_alignment' in the setup_hdr.  If that's 0, it
tries to unpack to address 0, which in fact causes the gunzip code
to call 'error("Out of memory while allocating output buffer")'.

The bootloader (ie. the lguest Launcher in this case) should be doing
setting this field; the normal bzImage is 16M, we can use the same.

Reported-by: Stefanos Geraggelos <sgerag@cslab.ece.ntua.gr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: stable@kernel.org

e22a5398

virtio: Add text copy of spec to Documentation/virtual. · c3c53a07
Rusty Russell authored Aug 15, 2011
```
As suggested by Christoph Hellwig.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
```
c3c53a07

Jul 22, 2011

lguest: update comments · 9f54288d

Rusty Russell authored Jul 22, 2011



Also removes a long-unused #define and an extraneous semicolon.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

9f54288d

lguest: Simplify device initialization. · 3c3ed482

Rusty Russell authored Jul 22, 2011



We used to notify the Host every time we updated a device's status.  However,
it only really needs to know when we're resetting the device, or failed to
initialize it, or when we've finished our feature negotiation.

In particular, we used to wait for VIRTIO_CONFIG_S_DRIVER_OK in the
status byte before starting the device service threads.  But this
corresponds to the successful finish of device initialization, which
might (like virtio_blk's partition scanning) use the device.  So we
had a hack, if they used the device before we expected we started the
threads anyway.

Now we hook into the finalize_features hook in the Guest: at that
point we tell the Launcher that it can rely on the features we have
acked.  On the Launcher side, we look at the status at that point, and
start servicing the device.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

3c3ed482

lguest: Do not exit on non-fatal errors · e0377e25

Sakari Ailus authored Jun 26, 2011



Do not exit on some non-fatal errors:

- writev() fails in net_output(). The result is a lost packet or packets.
- writev() fails in console_output(). The result is partially lost console
output.
- readv() fails in net_input(). The result is a lost packet or packets.

Rather than bringing the guest down, this patch ignores e.g. an allocation
failure on the host side. Example:

lguest: page allocation failure. order:4, mode:0x4d0
Pid: 4045, comm: lguest Tainted: G        W   2.6.36 #1
Call Trace:
 [<c138d614>] ? printk+0x18/0x1c
 [<c106a4e2>] __alloc_pages_nodemask+0x4d2/0x570
 [<c1087954>] cache_alloc_refill+0x2a4/0x4d0
 [<c1305149>] ? __netif_receive_skb+0x189/0x270
 [<c1087c5a>] __kmalloc+0xda/0xf0
 [<c12fffa5>] __alloc_skb+0x55/0x100
 [<c1305519>] ? net_rx_action+0x79/0x100
 [<c12fafed>] sock_alloc_send_pskb+0x18d/0x280
 [<c11fda25>] ? _copy_from_user+0x35/0x130
 [<c13010b6>] ? memcpy_fromiovecend+0x56/0x80
 [<c12a74dc>] tun_chr_aio_write+0x1cc/0x500
 [<c108a125>] do_sync_readv_writev+0x95/0xd0
 [<c11fda25>] ? _copy_from_user+0x35/0x130
 [<c1089fa8>] ? rw_copy_check_uvector+0x58/0x100
 [<c108a7bc>] do_readv_writev+0x9c/0x1d0
 [<c12a7310>] ? tun_chr_aio_write+0x0/0x500
 [<c108a93a>] vfs_writev+0x4a/0x60
 [<c108aa21>] sys_writev+0x41/0x80
 [<c138f061>] syscall_call+0x7/0xb
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
HighMem per-cpu:
CPU    0: hi:  186, btch:  31 usd:   0
active_anon:134651 inactive_anon:50543 isolated_anon:0
 active_file:96881 inactive_file:132007 isolated_file:0
 unevictable:0 dirty:3 writeback:0 unstable:0
 free:91374 slab_reclaimable:6300 slab_unreclaimable:2802
 mapped:2281 shmem:9 pagetables:330 bounce:0
DMA free:3524kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:8kB active_file:8760kB inactive_file:2760kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15868kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:0kB slab_reclaimable:88kB slab_unreclaimable:148kB kernel_stack:40kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 865 2016 2016
Normal free:150100kB min:3728kB low:4660kB high:5592kB active_anon:6224kB inactive_anon:15772kB active_file:324084kB inactive_file:325944kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:12kB writeback:0kB mapped:1520kB shmem:0kB slab_reclaimable:25112kB slab_unreclaimable:11060kB kernel_stack:1888kB pagetables:1320kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 9207 9207
HighMem free:211872kB min:512kB low:1752kB high:2992kB active_anon:532380kB inactive_anon:186392kB active_file:54680kB inactive_file:199324kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1178504kB mlocked:0kB dirty:0kB writeback:0kB mapped:7588kB shmem:36kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 3*4kB 65*8kB 35*16kB 18*32kB 11*64kB 9*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3524kB
Normal: 35981*4kB 344*8kB 158*16kB 28*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 150100kB
HighMem: 5732*4kB 5462*8kB 2826*16kB 1598*32kB 84*64kB 10*128kB 7*256kB 1*512kB 1*1024kB 1*2048kB 9*4096kB = 211872kB
231237 total pagecache pages
2340 pages in swap cache
Swap cache stats: add 160060, delete 157720, find 189017/194106
Free swap  = 4179840kB
Total swap = 4194300kB
524271 pages RAM
296946 pages HighMem
5668 pages reserved
867664 pages shared
82155 pages non-shared

Signed-off-by: Sakari Ailus <sakari.ailus@iki.fi>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

e0377e25

Jul 12, 2011

KVM: KVM Steal time guest/host interface · 9ddabbe7

Glauber Costa authored Jul 11, 2011



To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Eric B Munson <emunson@mgebm.net>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: Peter Zijlstra <peterz@infradead.org>
CC: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

9ddabbe7

KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests · aa04b4cc

Paul Mackerras authored Jun 29, 2011

This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.

Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.

KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.

This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.

Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.

This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.

Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

aa04b4cc

KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6

Paul Mackerras authored Jun 29, 2011

This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.

This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.

To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.

When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.

We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.

When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).

It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.

Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

371fefd6

KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode · 54738c09

David Gibson authored Jun 29, 2011



This improves I/O performance for guests using the PAPR
paravirtualization interface by making the H_PUT_TCE hcall faster, by
implementing it in real mode.  H_PUT_TCE is used for updating virtual
IOMMU tables, and is used both for virtual I/O and for real I/O in the
PAPR interface.

Since this moves the IOMMU tables into the kernel, we define a new
KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
ioctl returns a file descriptor which can be used to mmap the newly
created table.  The qemu driver models use them in the same way as
userspace managed tables, but they can be updated directly by the
guest with a real-mode H_PUT_TCE implementation, reducing the number
of host/guest context switches during guest IO.

There are certain circumstances where it is useful for userland qemu
to write to the TCE table even if the kernel H_PUT_TCE path is used
most of the time.  Specifically, allowing this will avoid awkwardness
when we need to reset the table.  More importantly, we will in the
future need to write the table in order to restore its state after a
checkpoint resume or migration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

54738c09

KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948

Paul Mackerras authored Jun 29, 2011

This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.

This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.

Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.

This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.

With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.

We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.

In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.

The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.

This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.

This also adds a few new exports needed by the book3s_hv code.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>

de56a948

KVM: PPC: e500: enable magic page · a4cd8b23

Scott Wood authored Jun 14, 2011



This is a shared page used for paravirtualization.  It is always present
in the guest kernel's effective address space at the address indicated
by the hypercall that enables it.

The physical address specified by the hypercall is not used, as
e500 does not have real mode.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>

a4cd8b23

KVM: MMU: Adjust shadow paging to work when SMEP=1 and CR0.WP=0 · 411c588d

Avi Kivity authored Jun 06, 2011



When CR0.WP=0, we sometimes map user pages as kernel pages (to allow
the kernel to write to them).  Unfortunately this also allows the kernel
to fetch from these pages, even if CR4.SMEP is set.

Adjust for this by also setting NX on the spte in these circumstances.

Signed-off-by: Avi Kivity <avi@redhat.com>

411c588d

KVM: Fix KVM_ASSIGN_SET_MSIX_ENTRY documentation · 58f0964e

Jan Kiszka authored Jun 11, 2011



The documented behavior did not match the implemented one (which also
never changed).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

58f0964e

KVM: Clarify KVM_ASSIGN_PCI_DEVICE documentation · 91e3d71d

Jan Kiszka authored Jun 03, 2011



Neither host_irq nor the guest_msi struct are used anymore today.
Tag the former, drop the latter to avoid confusion.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

91e3d71d

KVM: Fixup documentation section numbering · 7f4382e8

Jan Kiszka authored Jun 02, 2011



Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

7f4382e8

KVM: Document KVM_IOEVENTFD · 55399a02

Sasha Levin authored May 28, 2011



Document KVM_IOEVENTFD that can be used to receive
notifications of PIO/MMIO events without triggering
an exit.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

55399a02

KVM: nVMX: Documentation · 823e3965

Nadav Har'El authored May 25, 2011



This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

[marcelo: move to Documentation/virtual/kvm]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

823e3965

KVM: Document KVM_GET_LAPIC, KVM_SET_LAPIC ioctl · e7677933

Avi Kivity authored May 11, 2011



Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

e7677933

May 30, 2011

lguest: remove support for VIRTIO_F_NOTIFY_ON_EMPTY. · 990c91f0

Rusty Russell authored May 30, 2011



No virtio device does this any more, so no need to clutter lguest with it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

990c91f0

lguest: fix up compilation after move · bc805a03

Rusty Russell authored May 30, 2011



ed16648e "Move kvm, uml, and lguest
subdirectories" broke the lguest example launcher.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

bc805a03

May 25, 2011

um: add ucast ethernet transport · 4ff4d8d3

Nolan Leake authored May 24, 2011



The ucast transport is similar to the mcast transport (and, in fact,
shares most of its code), only it uses UDP unicast to move packets.

Obviously this is only useful for point-to-point connections between
virtual ethernet devices.

Signed-off-by: Nolan Leake <nolan@cumulusnetworks.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

4ff4d8d3

May 06, 2011

Correct occurrences of · 61516587

Rob Landley authored May 06, 2011


- Documentation/kvm/ to Documentation/virtual/kvm
- Documentation/uml/ to Documentation/virtual/uml
- Documentation/lguest/ to Documentation/virtual/lguest
throughout the kernel source tree.

Signed-off-by: Rob Landley <rob@landley.net>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>

61516587

Add a 00-INDEX file to Documentation/virtual · 570aa13a

Rob Landley authored May 06, 2011


Remove uml from the top level 00-INDEX file.

Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>

570aa13a

Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E: · ed16648e

Rob Landley authored May 06, 2011



  cd Documentation
  mkdir virtual
  git mv kvm uml lguest virtual

Signed-off-by: Rob Landley <rlandley@parallels.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>

ed16648e