Until now, the VMM was only accepting a single instance of virtio-pmem
device. This commit extend the virtio-pmem support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add 2 integration tests to validate virtio-pmem works as expected.
One test takes care of checking the ability to read and write to this
persistent memory from the guest, and validates that the data is
carried over the virtualization boundary.
The other test ensures the VM can be booted directly from an image
that would be passed through virtio-pmem.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch plumbs the virtio-pmem device to the VMM. By adding a new
command line option "--pmem", we can now expose some persistent memory
to the guest OS, backed by the provided source.
The point of having such support in cloud-hypervisor is to be able to
share some memory between the host and the guest as DAXable.
One interesting use case is to boot directly from an image passed
through virtio-pmem, instead of going through virtio-blk. This can
allow good performances while avoiding the guest cache, which would
prevent the VM memory footprint from growing too much.
Fixes#68
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces the implementation of the virtio-pmem device
based on the pending proposal of the virtio specification here:
https://lists.oasis-open.org/archives/virtio-dev/201903/msg00083.html
It is also based on the kernel patches coming along with the virtio
proposal: https://lkml.org/lkml/2019/6/12/624
And it is based off of the current crosvm implementation found in
devices/src/virtio/pmem.rs relying on commit
bb340d9a94d48514cbe310d05e1ce539aae31264
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add some documentation specific to virtio-fs and how to perform
filesystem sharing between host and guest with cloud-hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the VMM was only accepting a single instance of a virtio-fs
device. This commit extend the virtio-fs support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces the testing of the --fs option based on the
virtio-fs implementation. This does not simply add a test, but also
updates the integration script by generating a new kernel embedding
the virtio-fs patches and by downloading the virtiofsd daemon.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In the context of vhost-user, we need the guest RAM to be backed by
a file in order to be accessed by an external process. This patch
adds the new flag "file=" to the "--memory" option so that we can
specify from the command line if the memory needs to be backed, and
by which specific file.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The user can now share some files and directories with the guest by
providing the corresponding vhost-user socket. The virtiofsd daemon
should be started by the user before to start the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The vhost-user-fs or virtio-fs device allows files and directories to
be shared between host and guest. This patch adds the implementation
of this device to the cloud-hypervisor device model.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to avoid cloud-hypervisor to rely on a pending PR for the empty
crate "vhost", this commit temporarily copies the content of the crate
based on branch jiangliu/v1 18b5081d9199c76eca49da1971c9d1a65e53e5ff.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
BusDevice includes two methods which are only for PCI devices, which should
be as members of PciDevice trait for a better clean high level APIs.
Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
Based on the newly added code, we expect the split irqchip to be used.
This means we should not see any "timer" or "cascade" components
attached to the IOAPIC since our userspace IOAPIC does not advertise
those.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The previous commit introduced a userspace implementation of an IOAPIC
and this commits aims to plumb it into the cloud-hypervisor VMM.
Here is the list of new things brought by this patch:
- Update the rust-vmm/kvm-ioctls dependency to benefit from latest
patches including the support for split irqchip, and the vector
being returned when a VM exit is caused by an EOI.
- Enable the split irqchip (which means no IOAPIC or PIC is emulated
in kernel). This is done conditionally based on the support of the
TSC_DEADLINE_TIMER from both KVM and the underlying CPU. The
dependency on TSC_DEADLINE_TIMER is related to KVM which does not
support creating the in kernel PIT if it has a split irqchip.
- Rely on callbacks to handle the following use cases:
- in kernel IOAPIC + serial IRQ (pin based)
- in kernel IOAPIC + virtio-pci MSI-X
- in kernel IOAPIC + virtio-pci IRQ (pin based)
- userspace IOAPIC + serial IRQ (pin based)
- userspace IOAPIC + virtio-pci MSI-X
- userspace IOAPIC + virtio-pci IRQ (pin based)
Fixes#13
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The goal for cloud-hypervisor is to keep the host safe. With this in
mind, we want to emulate as much as possible in userspace instead of
in kernel directly.
The IOAPIC is a good candidate to move from kernel to userspace, which
is why this commit introduces a userspace implementation of the IOAPIC
82093AA based on the documentation:
https://pdos.csail.mit.edu/6.828/2016/readings/ia32/ioapic.pdf
This code is inspired from the files devices/src/ioapic.rs and
devices/src/split_irqchip_common.rs from the crosvm codebase. The
reference version used being 6c1e23eee3065b3f3d6fc4fb992ac9884dbabf68.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit anticipate the future need from having support for both
in kernel and userspace IOAPIC. The way to signal an interrupt from
the serial device will vary depending on the use case, but this should
be independent from the serial implementation itself.
That's why this patch provides a generic trait for the serial device
to call from, so that it can trigger interrupts independently from the
IOAPIC type chosen (in kernel vs userspace).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We need to export the variable DEBIAN_FRONTEND=noninteractive from the
Jenkinsfile if we want to make sure the VM update won't get stuck into
an interactive window.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
VMM may load different format kernel image to start guest, we currently
only have elf loader support, so add bzimage loader support in case
that VMM would like to load bzimage.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
As more CPUID handling and CpuidPatch common code being added, it's
reasonable to move all the common code to the same place and in the
future we may consider move it to individual file when neccesary.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
KVM exposes CPUID 0BH when host supports that, but the APIC ID that KVM
provides is the host APIC ID so we need replace that with ours.
Without this Linux guest reports something like:
[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 APIC: 21
Fixes#42
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
As mentioned in the KVM documentation, TSC_DEADLINE_TIMER feature
needs some special checks to validate that it is supported as the
cpuid will always report it as disabled.
We need to use the KVM_CHECK_EXTENSION ioctl to request the value
of KVM_CAP_TSC_DEADLINE_TIMER. In case it is supported through
the local APIC emulation provided by the CREATE_IRQCHIP in KVM,
we have to set manually this feature by patching the cpuid.
Here quoted from the KVM documentation:
```
The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always
returned as false, since the feature depends on KVM_CREATE_IRQCHIP
for local APIC support. Instead it is reported via
ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
if that returns true and you use KVM_CREATE_IRQCHIP, or if you
emulate the feature in userspace, then you can enable the feature
for KVM_SET_CPUID2.
```
This patch implements the behavior described above, and this allows
the VMM to remove the emulated Programmable Interval Timer (PIT) when
the TSC_DEADLINE_TIMER feature can be enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Download and build a Linux kernel and use the vmlinux produced as the
kernel used with a direct boot kernel test.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With slide variations in the kernel the memory size checks can fail so
round down the testing numbers to the nearest multiple of 1000 to make
the tests more stable.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Switch the Clear Linux version to a newer release and cache that in an
azure bucket in the same region to improve the CI speed.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove some of the kernel configuration options that are not necessary
for manual testing and for testing with the CI in order to reduce the
kernel build time.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When the KVM capability KVM_CAP_SIGNAL_MSI is not present, the VMM
falls back from MSI-X onto pin based interrupts. Unfortunately, this
was not working as expected because the VirtioPciDevice object was
always creating an MSI-X capability structure in the PCI configuration
space. This was causing the guest drivers to expect MSI-X interrupts
instead of the pin based generated ones.
This patch takes care of avoiding the creation of a dedicated MSI-X
capability structure when MSI is not supported by KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As mentioned in the PCI specification, the Function Mask from the
Message Control Register can be set to prevent a device from injecting
MSI-X messages. This supersedes the vector masking as it interacts at
the device level.
Here quoted from the specification:
For MSI and MSI-X, while a vector is masked, the function is prohibited
from sending the associated message, and the function must set the
associated Pending bit whenever the function would otherwise send the
message. When software unmasks a vector whose associated Pending bit is
set, the function must schedule sending the associated message, and
clear the Pending bit as soon as the message has been sent. Note that
clearing the MSI-X Function Mask bit may result in many messages
needing to be sent.
This commit implements the behavior described above by reorganizing
the way the PCI configuration space is being written. It is indeed
important to be able to catch a change in the Message Control
Register without having to implement it for every PciDevice
implementation. Instead, the PciConfiguration has been modified to
take care of handling any update made to this register.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The current MSI-X implementation completely ignores the values found
in the Vector Control register related to a specific vector, and never
updates the Pending Bit Array.
According to the PCI specification, MSI-X vectors can be masked
through the Vector Control register on bit 0. If this bit is set,
the device should not inject any MSI message. When the device
runs into such situation, it must not inject the interrupt, but
instead it must update the bit corresponding to the vector number
in the Pending Bit Array.
Later on, if/when the Vector Control register is updated, and if
the bit 0 is flipped from 0 to 1, the device must look into the PBA
to find out if there was a pending interrupt for this specific
vector. If that's the case, an MSI message is injected and the
bit from the PBA is cleared.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As mentioned in the PCI specification:
If a dedicated Base Address register is not feasible, it is
recommended that a function isolate the MSI-X structures from
the non-MSI-X structures with aligned 8 KB ranges rather than
the mandatory aligned 4 KB ranges.
That's why this patch ensures that each structure present on the
BAR is 8KiB aligned.
It also fixes the MSI-X table and PBA sizes so that they can support
up to 2048 vectors, as specified for MSI-X.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As mentioned in the PCI specification, MSI-X table supports both
DWORD and QWORD accesses:
For all accesses to MSI-X Table and MSI-X PBA fields, software must
use aligned full DWORD or aligned full QWORD transactions; otherwise,
the result is undefined.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to factorize the complexity brought by closures, this commit
merges IrqClosure and MsixClosure into a generic InterruptDelivery one.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to allow virtio-pci devices to use MSI-X messages instead
of legacy pin based interrupts, this patch implements the MSI-X
support for cloud-hypervisor. The VMM code and virtio-pci bits have
been modified based on the "msix" module previously added to the pci
crate.
Fixes#12
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to have access to the newly added signal_msi() function
from the kvm-ioctls crate, this commit updates the version of the
kvm-ioctls to the latest one.
Because set_user_memory_region() has been swtiched to "unsafe", we
also need to handle this small change in our cloud-hypervisor code
directly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to support MSI-X, this commit adds to the pci crate a new
module called "msix". This module brings all the necessary pieces
to let any PCI device implement MSI-X support.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because we cannot always assume the irq fd will be the way to send
an IRQ to the guest, this means we cannot make the assumption that
every virtio device implementation should expect an EventFd to
trigger an IRQ.
This commit organizes the code related to virtio devices so that it
now expects a Rust closure instead of a known EventFd. This lets the
caller decide what should be done whenever a device needs to trigger
an interrupt to the guest.
The closure will allow for other type of interrupt mechanism such as
MSI to be implemented. From the device perspective, it could be a
pin based interrupt or an MSI, it does not matter since the device
will simply call into the provided callback, passing the appropriate
Queue as a reference. This design keeps the device model generic.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Launch the test binary by command rather than using using the vmm layer.
This makes it easier to manage the running VM as you can explicitly kill
it.
Also switch to using credibility for the tests which catches assertions
and continues with subsequent commands and reports the issues at the
end. This means it is possible to cleanup even on failed test runs.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add basic integration testing of the hypervisor using a cloud-init to
configure the VM at boot and SSH to control it at runtime.
Initial test just boots the VM up checks some basic resources and
reboots. With a second test that calls into the first to check that
subsequent tests work correctly.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When not running on a tty (tested with libc's isatty()) disable stdin
and do not reconfigure the terminal.
This is required to ensure that the VM responds correctly when running
in a headless environment such as Jenkins.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>