In case the VM is created from scratch, the devices should be created
after the DeviceManager has been created. But this should not affect the
restore codepath, as in this case the devices should be created as part
of the restore() function.
It's necessary to perform this differentiation as the restore must go
through the following steps:
- Create the DeviceManager
- Restore the DeviceManager with the right state
- Create the devices based on the restored DeviceManager's device tree
- Restore each device based on the restored DeviceManager's device tree
That's why this patch leverages the recent split of the DeviceManager's
creation to achieve what's needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit performs the split of the DeviceManager's creation into two
separate functions by moving anything related to device's creation after
the DeviceManager structure has been initialized.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If the current state is paused that means most of the handles got killed by pthread_kill
We need to unpark those threads to make the shutdown worked. Otherwise
The shutdown API hangs and the API is not responding afterwards. So
before the shutdown call we need to resume the VM make it succeed.
Fixes: #817
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Adds DeviceManager method `make_virtio_fs_device` which creates a single
device, and modifies `make_virtio_fs_devices` to use this method.
Implements the new `vm.add-fs route`.
Signed-off-by: Dean Sheather <dean@coder.com>
We can now allow guests that specify an initramfs to boot
using the PVH boot protocol.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
When performing an API boot validate the configuration. For now only
some very basic validation is performed but in subsequent commits
the validation will be extended.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that the restore path uses RestoreConfig structure, we add a new
parameter called "prefault" to it. This will give the user the ability
to populate the pages corresponding to the mapped regions backed by the
snapshotted memory files.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The goal here is to move the restore parameters into a dedicated
structure that can be reused from the entire codebase, making the
addition or removal of a parameter easier.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When CoW can be used, the VM restoration time is reduced, but the pages
are not populated. This can lead to some slowness from the guest when
accessing these pages.
Depending on the use case, we might prefer a slower boot time for better
performances from guest runtime. The way to achieve this is to prefault
the pages in this case, using the MAP_POPULATE flag along with CoW.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Whenever a MemoryManager is restored from a snapshot, the memory regions
associated with it might need to directly back the mapped memory for
increased performances. If that's the case, a list of external regions
is provided and the MemoryManager should simply ignore what's coming
from the MemoryConfig.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The MemoryManager is somehow a special case, as its restore() function
was not implemented as part of the Snapshottable trait. Instead, and
because restoring memory regions rely both on vm.json and every memory
region snapshot file, the memory manager is restored at creation time.
This makes the restore path slightly different from CpuManager, Vcpu,
DeviceManager and Vm, but achieve the correct restoration of the
MemoryManager along with its memory regions filled with the correct
content.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This is only implementing the send() function in order to store all Vm
states into a file.
This needs to be extended for live migration, by adding more transport
methods, and also the recv() function must be implemented.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
By aggregating snapshots from the CpuManager, the MemoryManager and the
DeviceManager, Vm implements the snapshot() function from the
Snapshottable trait.
And by restoring snapshots from the CpuManager, the MemoryManager and
the DeviceManager, Vm implements the restore() function from the
Snapshottable trait.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Implement the Snapshottable trait for Vcpu, and then implements it for
CpuManager. Note that CpuManager goes through the Snapshottable
implementation of Vcpu for every vCPU in order to implement the
Snapshottable trait for itself.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
A Snapshottable component can snapshot itself and
provide a MigrationSnapshot payload as a result.
A MigrationSnapshot payload is a map of component IDs to a list of
migration sections (MigrationSection). As component can be made of
several Migratable sub-components (e.g. the DeviceManager and its
device objects), a migration snapshot can be made of multiple snapshot
itself.
A snapshot is a list of migration sections, each section being a
component state snapshot. Having multiple sections allows for easier and
backward compatible migration payload extensions.
Once created, a migratable component snapshot may be transported and this
is what the Transportable trait defines, through 2 methods: send and recv.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Whenever the memory is resized, it's important to retrieve the new
region to pass it down to the device manager, this way it can decide
what to do with it.
Also, there's no need to use a boolean as we can instead use an Option
to carry the information about the region. In case of virtio-mem, there
will be no region since the whole memory has been reserved up front by
the VMM at boot. This means only the ACPI hotplug will return a region
and is the only method that requires the memory to be updated from the
device manager.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Commit 2adddce2 reorganized the crate for a cleaner multi architecture
(x86_64 and aarch64) support.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
For now, the codebase does not support booting from initramfs with PVH
boot protocol, therefore we need to fallback to the legacy boot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
* load the initramfs File into the guest memory, aligned to page size
* finally setup the initramfs address and its size into the boot params
(in configure_64bit_boot)
Signed-off-by: Damjan Georgievski <gdamjan@gmail.com>
The persistent memory will be hotplugged via DeviceManager and saved in
the config for later use.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit adds new option hotplug_method to memory config.
It can set the hotplug method to "acpi" or "virtio-mem".
Signed-off-by: Hui Zhu <teawater@antfin.com>
The persistent memory will be hotplugged via DeviceManager and saved in
the config for later use.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever the VM memory is resized, DeviceManager needs to be notified
so that it can subsequently notify each virtio devices about it.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This feature is stable and there is no need for this to be behind a
flag. This will also reduce the time needed to run the integration test
as we will not be running them all again under the flag.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
I spent a few minutes trying to understand why we were unconditionally
updating the VM config memory size, even if the guest memory resizing
did not happen.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Fill the hvm_start_info and related memory map structures as
specified in the PVH boot protocol. Write the data structures
to guest memory at the GPA that will be stored in %rbx when
the guest starts.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
In order to properly initialize the kvm regs/sregs structs for
the guest, the load_kernel() return type must specify which
boot protocol to use with the entry point address it returns.
Make load_kernel() return an EntryPoint struct containing the
required information. This structure will later be used
in the vCPU configuration methods to setup the appropriate
initial conditions for the guest.
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Add a new id option to the VFIO hotplug command so that it matches the
VFIO coldplug semantic.
This is done by refactoring the existing code for VFIO hotplug, where
VmAddDeviceData structure is replaced by DeviceConfig. This structure is
the one used whenever a VFIO device is coldplugged, which is why it
makes sense to reuse it for the hotplug codepath.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit ensures that when a VFIO device is hot-unplugged from the
VM, it is also removed from the VmConfig. This prevents a potential
reboot from creating the device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces the new command "remove-device" that will let a
user hot-unplug a VFIO PCI device from an already running VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The Vm structure was used to store a strong reference to the IO bus.
This is not needed anymore since the AddressManager is logically the
one holding this strong reference. This has been made possible by the
introduction of Weak references on the Bus structure itself.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The method add_vfio_device() from the DeviceManager needs to be mutable
if we want later to be able to update some internal fields from the
DeviceManager from this same function.
This commit simply takes care of making the necessary changes to change
this function as mutable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's more logical to name the field referring to the DeviceManager as
"device_manager" instead of "devices".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By inserting the DeviceManager on the IO bus, we introduced some cyclic
dependency:
DeviceManager ---> AddressManager ---> Bus ---> BusDevice
^ |
| |
+---------------------------------------------+
This cycle needs to be broken by inserting a Weak reference instead of
an Arc (considered as a strong reference).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Ensures the configuration is updated after a new device has been
hotplugged. In the event of a reboot, this means the new VM will be
started with the new device that had been previously hotplugged.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit finalizes the VFIO PCI hotplug support, based on all the
previous commits preparing for it.
One thing to notice, this does not support vIOMMU yet. This means we can
hotplug VFIO PCI devices, but we cannot attach them to an existing or a
new virtio-iommu device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Whenever the user wants to hotplug a new VFIO PCI device, the VMM will
have to trigger a hotplug notification through the GED device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces the new command "add-device" that will let a user
hotplug a VFIO PCI device to an already running VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation of the support for device hotplug, this commit moves the
DeviceManager object into an Arc<Mutex<>> when the DeviceManager is
being created. The reason is, we need the DeviceManager to implement the
BusDevice trait and then provide it to the IO bus, so that IO accesses
related to device hotplug can be handled correctly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Relying on the latest vm-memory version, including the freshly
introduced structure GuestMemoryAtomic, this patch replaces every
occurrence of Arc<ArcSwap<GuestMemoryMmap> with
GuestMemoryAtomic<GuestMemoryMmap>.
The point is to rely on the common RCU-like implementation from
vm-memory so that we don't have to do it from Cloud-Hypervisor.
Fixes#735
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It is necessary to do this at the start of the VMM execution rather than
later as it must be done in the main thread in order to satisfy the
checks required by PTRACE_MODE_READ_FSCREDS (see proc(5) and
ptrace(2))
The alternative is to run as CAP_SYS_PTRACE but that has its
disadvantages.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the ioctl syscall KVM_CREATE_VM gets interrupted while creating the
VM, it is expected that we should retry since EINTR should not be
considered a standard error.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The build is run against "--all-features", "pci,acpi", "pci" and "mmio"
separately. The clippy validation must be run against the same set of
features in order to validate the code is correct.
Because of these new checks, this commit includes multiple fixes
related to the errors generated when manually running the checks.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Use independent bits for storing whether there is a CPU or memory device
changed when reporting changes via ACPI GED interrupt. This prevents a
later notification squashing an earlier one and ensure that hotplugging
both CPU and memory at the same time succeeds.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If a new amount of RAM is requested in the VmResize command try and
hotplug if it an increase (MemoryManager::Resize() silently ignores
decreases.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Generate and expose the DSDT table entries required to support memory
hotplug. The AML methods call into the MemoryManager via I/O ports
exposed as fields.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For now the new memory size is only used after a reboot but support for
hotplugging memory will be added in a later commit.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This specifies how much address space should be reserved for hotplugging
of RAM. This space is reserved by adding move the start of the device
area by the desired amount.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to be able to support resizing either vCPUs or memory or both
make the fields in the resize command optional.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows us to change the memory map that is being used by the
devices via an atomic swap (by replacing the map with another one). The
ArcSwap provides the mechanism for atomically swapping from to another
whilst still giving good read performace. It is inside an Arc so that we
can use a single ArcSwap for all users.
Not covered by this change is replacing the GuestMemoryMmap itself.
This change also removes some vertical whitespace from use blocks in the
files that this commit also changed. Vertical whitespace was being used
inconsistently and broke rustfmt's behaviour of ordering the imports as
it would only do it within the block.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This removes the need to handle a mutable integer and also centralises
the allocation of these slot numbers.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The memory manager is responsible for setting up the guest memory and in
the long term will also handle addition of guest memory.
In this commit move code for creating the backing memory and populating
the allocator into the new implementation trying to make as minimal
changes to other code as possible.
Follow on commits will further reduce some of the duplicated code.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since the Snapshotable placeholder and Migratable traits are provided as
well, the DeviceManager object and all its objects are now Migratable.
All Migratable devices are tracked as Arc<Mutex<dyn Migratable>>
references.
Keeping track of all migratable devices allows for implementing the
Migratable trait for the DeviceManager structure, making the whole
device model potentially migratable.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Now that the GED device does not use a hardcoded IRQ number the starting
IRQ number can be restored (needed for the hardcoded serial port IRQ.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the code for handling the creation of the DSDT entries for devices
into the DeviceManager.
This will make it easier to handle device hotplug and also in the future
remove some hardcoded ACPI constants.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the code for generating the MADT (APIC) table and the DSDT
generation for CPU related functionality into the CpuManager.
There is no functional change just code rearrangement.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Previously the device setup code assumed that if no IOAPIC was passed in
then the device should be added to the kernel irqchip. As an earlier
change meant that there was always a userspace IOAPIC this kernel based
code can be removed.
The accessor still returns an Option type to leave scope for
implementing a situation without an IOAPIC (no serial or GED device).
This change does not add support no-IOAPIC mode as the original code did
not either.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Update the configuration after a resize to ensure that after a reboot
the added vCPUs are preserved.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The tty mode remains raw mode when cloud-hypervisor is terminted by
SIGTERM or SIGINT. The terminal is unusable due to echoing is
disabled which is really annoying.
Signed-off-by: Qiu Wenbo <qiuwenbo@phytium.com.cn>
Some physical address bits may become reserved in page table when SME
is enabled on AMD platform. Guest will trigger a reserved bit
violation page fault in this case due to write these reserved bits to 1
in page table. We need reduce the reserved bits to get the right
physical address range.
Signed-off-by: Qiu Wenbo <qiuwenbo@phytium.com.cn>
Add ability to notify via the GED device that there is some new hotplug
activity. This will be used by the CpuManager (and later DeviceManager
itself) to notify of new hotplug activity.
Currently it has a hardcoded IRQ of 5 as the ACPI tables also need to
refer to this IRQ and the IRQ allocation does not permit the allocation
of specific IRQs.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Currently only increasing the number of vCPUs is supported but in the
future it will be extended.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The MADT table contains the details of all the potential vCPUs and
whether they are present at boot (as indicated by the flags field.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When initialising the ACPI tables and configuring the VM use the new
accessor on the CpuManager to get the number of boot vCPUs.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since the kvm crates now depend on vmm-sys-util, the bump must be
atomic.
The kvm-bindings and ioctls 0.2.0 and 0.4.0 crates come with a few API
changes, one of them being the use of a kvm_ioctls specific error type.
Porting our code to that type makes for a fairly large diff stat.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In case the VM is started with the flag "--memory mergeable=on", it
means the user expects the guest RAM pages to be marked as mergeable.
This commit relies on the madvise(MADV_MERGEABLE) system call to inform
the host kernel about these pages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Move CpuManager, Vcpu and related functionality to its own module (and
file) inside the VMM crate
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Pull details of vCPU management (booting, pausing, resuming, shutdown)
into it's own structure. This will ultimately enable this to be moved to
its own file and encapsulate all the vCPU handling for the VMM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove ACPI table creation from arch crate to the vmm crate simplifying
arch::configure_system()
GuestAddress(0) is used to mean no RSDP table rather than adding
complexity with a conditional argument or an Option type as it will
evaluate to a zero value which would be the default anyway.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We need to rely on the latest kvm-ioctls version to benefit from the
recent addition of unregister_ioevent(), allowing us to detach a
previously registered eventfd to a PIO or MMIO guest address.
Because of this update, we had to modify the current constraint we had
on the vmm-sys-util crate, using ">= 0.1.1" instead of being strictly
tied to "0.2.0".
Once the dependency conflict resolved, this commit took care of fixing
build issues caused by recent modification of kvm-ioctls relying on
EventFd reference instead of RawFd.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to reuse the SystemAllocator later at runtime, it is moved into
the new structure AddressManager. The goal is to have a hold onto the
SystemAllocator and both IO and MMIO buses so that we can use them
later.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We should return an explicit error when the transition from on VM state
to another is invalid.
The valid_transition() routine for the VmState enum essentially
describes the VM state machine.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to pause a VM, we signal all the vCPU threads to get them out
of vmx non-root. Once out, the vCPU thread will check for a an atomic
pause boolean. If it's set to true, then the thread will park until
being resumed.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
So that we don't need to forward an ExitBehaviour up to the VMM thread.
This simplifies the control loop and the VMM thread even further.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This commit is the glue between the virtio-pci devices attached to the
vIOMMU, and the IORT ACPI table exposing them to the guest as sitting
behind this vIOMMU.
An important thing is the trait implementation provided to the virtio
vrings for each device attached to the vIOMMU, as they need to perform
proper address translation before they can access the buffers.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtual IOMMU exposed through virtio-iommu device has a dependency
on ACPI. It needs to expose the device ID of the virtio-iommu device,
and all the other devices attached to this virtual IOMMU. The IDs are
expressed from a PCI bus perspective, based on segment, bus, device and
function.
The guest relies on the topology description provided by the IORT table
to attach devices to the virtio-iommu device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We used to have errors definitions spread across vmm, vm, api,
and http.
We now have a cleaner separation: All API routines only return an
ApiResult. All VM operations, including the VMM wrappers, return a
VmResult. This makes it easier to carry errors up to the HTTP caller.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The linux_loader crate Cmdline struct is not serializable.
Instead of forcing the upstream create to carry a serde dependency, we
simply use a String for the passed command line and build the actual
CmdLine when we need it (in vm::new()).
Also, the cmdline offset is not a configuration knob, so we remove it.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The kernel path was the only mandatory command line option.
With the addition of the --api-socket option, we can run without a
kernel path and get it later through the API.
Since we can end up with VM configurations that are no longer valid by
default, we need to provide a validation check for it. For now, if the
kernel path is not defined, the VM configuration is invalid.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Using the existing layout module start documenting the major regions of
RAM and those areas that are reserved. Some of the constants have also
been renamed to be more consistent and some functions that returned
constant variables have been replaced.
Future commits will move more constants into this file to make it the
canonical source of information about the memory layout.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We now start the main VMM thread, which will be listening for VM and IPC
related events.
In order to start the configured VM, we no longer directly call the VM
API but we use the IPC instead, to first create and then start a VM.
Fixes: #303
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The VMM thread and control loop will be the sole consumer of the
EpollContext and EpollDispatch API, so let's move it to lib.rs.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
As we're going to move the control loop to the VMM thread, the exit and
reset EventFds are no longer going to be owned by the VM.
We pass a copy of them when creating the Vm instead.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to handle the VM STDIN stream from a separate VMM thread
without having to export the DeviceManager, we simply add a console
handling method to the Vm structure.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to transfer the control loop to a separate VMM thread, we want
to shrink the VM control loop to a bare minimum.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Once passed to the VM creation routine, a VmConfig structure is
immutable. We can simply carry a Arc of it instead of a reference.
This also allows us to remove any lifetime bound from our VM.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The Vmm structure is just a placeholder for the KVM instance. We can
create it directly from the VM creation routine instead.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We can integrate the kernel loading into the VM start method.
The VM start flow is then: Vm::new() -> vm.start(), which feels more
natural.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Convert Path to PathBuf and remove the associated lifetime.
Now we can remove the VmConfig associated lifetime.
Fixes#298
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Probe for the size of the host physical address range and use that to
establish the address range for the VM. This removes the limitation on
the size of the VM RAM and gives more space for the devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
After the 32-bit gap the memory is shared between the devices and the
RAM. Ensure that the ACPI tables correctly indicate where the RAM ends
and the device area starts by patching the precompiled tables. We get
the following valid output now from the PCI bus probing (8GiB guest)
[ 0.317757] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 0.319035] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 0.320215] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[ 0.321431] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xfebfffff window]
[ 0.322613] pci_bus 0000:00: resource 8 [mem 0x240000000-0xfffffffff window]
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than calling it at the very start of the VM execution (i.e. when
the VCPUs are created) do it as part of the DeviceManager creation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than sending a signal to the signal handler used for handling
SIGWINCH calls instead use the crate provided termination method. This
also unregisters the signal handler which also means that there won't be
a leaked signal handler remaining.
This leaked signal handler is what was causing a failure to cleanup up
the thread on subsequent requests breaking two reboots in a row.
Fixes: #252
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Refactor out DeviceManager into it's own file. This is part of a bigger
effort to reduce complexity in the vm.rs file but will also allow future
separation to allow making PCI support optional.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For virtio-fs and virtio-pmem regions of memory are manually mapped into
the address space of the VMM. In order to cleanly reboot we need to
unmap those regions.
Fixes: #223
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Do this by using the same mechanism as the vCPU threads by sending a
signal to the thread. As this is the same mechanism reuse the same code
and rename the "vcpus" member to "threads" to indicate this represents
both the vCPU threads and also the signal handler thread.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Put the ACPI support behind a feature and ensure that the code compiles
without that feature by adding an extra build to Travis.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
As part of the cleanup of the VM shutdown all the vCPU threads. This is
achieved by toggling a shared atomic boolean variable which is checked
in the vCPU loop. To trigger the vCPU code to look at this boolean it is
necessary to send a signal to the vCPU which will interrupt the running
KVM_RUN ioctl.
Fixes: #229
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Sadly only the first few characters of the thread name is preserved so
use a shorter name for the vCPU thread for now. Also give the signal
handling thread a name.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add an I/O port "device" to handle requests from the kernel to shutdown
or trigger a reboot, borrowing an I/O used for ACPI on the Q35 platform.
The details of this I/O port are included in the FADT
(SLEEP_STATUS_REG/SLEEP_CONTROL_REG/RESET_REG) with the details of the
value to write in the FADT for reset (RESET_VALUE) and in the DSDT for
shutdown (S5 -> 0x05)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a 2nd EventFd to the VM to control resetting (rebooting) the VM this
supplements the EventFd used for managing shutdown of the VM.
The default behaviour on i8042 or triple-fault based reset is currently
unchanged i.e. it will trigger a shutdown.
In order to support restarting the VM it was necessary to make start()
function take a reference to the config.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Only add the ACPI PNP device for the COM1 serial port if it is not
turned off with "--serial off"
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Currently when the VCPU thread exits on an error the VMM continues to
run with no way of shutting down the main thread.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This patch factorizes the existing virtio-fs code by relying onto the
common code part of the vhost_user module in the vm-virtio crate.
In details, it factorizes the vhost-user setup, and reuses the error
types defined by the module instead of defining its own types.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
vhost-user-net introduced a new module vhost_user inside the vm-virtio
crate. Because virtio-fs is actually vhost-user-fs, it belongs to this
new module and needs to be moved there.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The currently directory handling process to open tempfile by
OpenOptions with custom_flags(O_TMPFILE) is workable for tmp
filesystem, but not workable for hugetlbfs, add new directory
handling process which works fine for both tmpfs and hugetlbfs.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Following the refactoring of the code allowing multiple threads to
access the same instance of the guest memory, this patch goes one step
further by adding RwLock to it. This anticipates the future need for
being able to modify the content of the guest memory at runtime.
The reasons for adding regions to an existing guest memory could be:
- Add virtio-pmem and virtio-fs regions after the guest memory was
created.
- Support future hotplug of devices, memory, or anything that would
require more memory at runtime.
Because most of the time, the lock will be taken as read only, using
RwLock instead of Mutex is the right approach.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The VMM guest memory was cloned (copied) everywhere the code needed to
have ownership of it. In order to clean the code, and in anticipation
for future support of modifying this guest memory instance at runtime,
it is important that every part of the code share the same instance.
Because VirtioDevice implementations need to have access to it from
different threads, that's why Arc must be used in this case.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Latest clippy version complains about our existing code for the
following reasons:
- trait objects without an explicit `dyn` are deprecated
- `...` range patterns are deprecated
- lint `clippy::const_static_lifetime` has been renamed to
`clippy::redundant_static_lifetimes`
- unnecessary `unsafe` block
- unneeded return statement
All these issues have been fixed through this patch, and rustfmt has
been run to cleanup potential formatting errors due to those changes.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We timestamp the VM creation time, and log the elapsed time between that
instant and the debug ioport events.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The 0x80 IO port is typically used for BIOS debugging and testing on
bare metal x86 platforms.
We use that port and its dedicated 16 debug codes to time and track the
guest boot process.
Fixes#63
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When the cache_size parameter from virtio-fs device is not empty, the
VMM creates a dedicated memory region where the shared files will be
memory mapped by the virtio-fs device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
One of the features of the virtio console device is its size can be
configured and updated. Our first iteration of the console device
implementation is lack of this feature. As a result, it had a
default fixed size which could not be changed. This commit implements
the console config feature and lets us change the console size from
the vmm side.
During the activation of the device, vmm reads the current terminal
size, sets the console configuration accordinly, and lets the driver
know about this configuration by sending an interrupt. Later, if
someone changes the terminal size, the vmm detects the corresponding
event, updates the configuration, and sends interrupt as before. As a
result, the console device driver, in the guest, updates the console
size.
Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
Poor performance was observed when booting kernels with "console=ttyS0"
and the serial port disabled.
This change introduces a "null" console output mode and makes it the
default for the serial console. In this case the serial port
is advertised as per other output modes but there is no input and any
output is dropped.
Fixes: #163
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The structure of the vmm-sys-util crate has changed with lots of code
moving to submodules.
This change adjusts the use of the imported structs to reference the
submodules.
Fixes: #145
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The existing code taking care of the epoll loop was too restrictive as
it was propagating the error returned from the epoll_wait() syscall, no
matter what was the error. This causes the epoll loop to be broken,
leading to the VM termination.
This patch enforces the parsing of the returned error and prevent from
the error propagation in case it is EINTR, which stands for Interrupted.
In case the epoll loop is interrupted, it is appropriate to retry.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to fix the clippy error complaining about the number of
arguments passed to a function exceeding the maximum of 7 arguments,
this patch factorizes those parameters into a more global one called
VmInfo.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since virtio-pmem uses a KVM user memory region, it needs to increment
the slot index in use to prevent from any conflict with further VFIO
allocations (used for mapping mappable memory BARs).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We add a Reserved region type at the end of the memory hole to prevent
32-bit devices allocations to overlap with architectural address ranges
like IOAPIC, TSS or APIC ones.
Eventually we should remove that reserved range by allocating all the
architectural ranges before letting 32-bit devices use the memory hole.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We want to be able to differentiate between memory regions that must be
managed separately from the main address space (e.g. the 32-bit memory
hole) and ones that are reserved (i.e. from which we don't want to allow
the VMM to allocate address ranges.
We are going to use a reserved memory region for restricting the 32-bit
memory hole from expanding beyond the IOAPIC and TSS addresses.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to correctly support multiple VFIO devices, we need to
increment the memory slot index every time it is being used to set some
user memory region through KVM. That's why the mem_slot parameter is
made mutable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The memory slot index provided to the DeviceManager was wrong since
only the RAM memory regions are set as user memory regions to KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
KVM does not support multiple KVM VFIO devices to be created when
trying to support multiple VFIO devices. This commit creates one
global KVM VFIO device being shared with every VFIO device, which
makes possible the support for passing several devices through the
VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the VFIO crate, we can now support directly assigned PCI devices
into cloud-hypervisor guests.
We support assigning multiple host devices, through the --device command
line parameter. This parameter takes the host device sysfs path.
Fixes: #60
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Our DeviceManager::new() routine is reaching north of 250 lines.
For simplicity and readbility sake, extract all virtio devices creation
code into their own routines.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
To use the implemented virtio console device, the users can select one
of the three options ("off", "tty" or "file=/path/to/the/file") with
the command line argument "--console". By default, the console is
enabled as a device named "hvc0" (option: tty). When "off" option is
used, the console device is not added to the VM configuration at all.
Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
With this new AddressAllocator as part of the SystemAllocator, the
VMM can now decide with finer granularity where to place memory.
By allocating the RAM and the hole into the MMIO address space, we
ensure that no memory will be allocated by accident where the RAM or
where the hole is.
And by creating the new MMIO hole address space, we create a subset
of the entire MMIO address space where we can place 32 bits BARs for
example.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
GSI (Global System Interrupt) is an extension of just a linear array of
IRQs. It takes IOAPICs into account for example.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
There is alignment support for AddressAllocator but there are occations
that the alignment is known only when we call allocate(). One example
is PCI BAR which is natually aligned, means for which we have to align
the base address to its size.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This is only for allocating the port IO address range.
If a platform does not have PIO devices at all, the address
range will simply be unused.
So, simplify the vm-allocator data structure by making both
MMIO and PIO mandatory.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This patch adds the support for both IO and Memory BARs by expecting
the function allocate_bars() to identify the type of each BAR.
Based on the type, register_mapping() insert the address range on the
appropriate bus (PIO or MMIO).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a "--serial" command line that takes as input either "off", "tty"
(default and current behaviour) and "file=/path/to/file".
When "--serial off" is used the serial device is not added to the VM
configuration at all.
Integration tests added that check for interrupts present (or not) and
that when sending to a file the file contains the expected serial
output.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that cloud-hypervisor VMM supports virtio-pmem, it can directly
boot a VM from an image exposed as a persistent memory block device.
That's why there is no need to force the --disk option as being
mandatory.
Fixes#90
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the VMM was only accepting a single instance of virtio-net
device. This commit extends the virtio-net support by allowing several
devices to be created for a single VM.
Fixes#71
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
For every parameter dealing with a size as option, such as memory or
virtio-pmem, the CLI can now parse sizes with the suffixes K, M or G.
Fixes#70
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
PciConfigIo is a legacy pci bus dispatcher, which manages all pci
devices including a pci root bridge. However, it is unnecessary to
design a complex hierarchy which redirects every access by PciRoot.
Since pci root bridge is also a pci device instance, and only contains
easy config space read/write, and PciConfigIo actually acts as a pci bus
to dispatch resource based resolving when VMExit, we re-arrange to make
the pci hierarchy clean.
Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
Until now, the VMM was only accepting a single instance of virtio-pmem
device. This commit extend the virtio-pmem support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch plumbs the virtio-pmem device to the VMM. By adding a new
command line option "--pmem", we can now expose some persistent memory
to the guest OS, backed by the provided source.
The point of having such support in cloud-hypervisor is to be able to
share some memory between the host and the guest as DAXable.
One interesting use case is to boot directly from an image passed
through virtio-pmem, instead of going through virtio-blk. This can
allow good performances while avoiding the guest cache, which would
prevent the VM memory footprint from growing too much.
Fixes#68
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the VMM was only accepting a single instance of a virtio-fs
device. This commit extend the virtio-fs support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In the context of vhost-user, we need the guest RAM to be backed by
a file in order to be accessed by an external process. This patch
adds the new flag "file=" to the "--memory" option so that we can
specify from the command line if the memory needs to be backed, and
by which specific file.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The user can now share some files and directories with the guest by
providing the corresponding vhost-user socket. The virtiofsd daemon
should be started by the user before to start the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
BusDevice includes two methods which are only for PCI devices, which should
be as members of PciDevice trait for a better clean high level APIs.
Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
The previous commit introduced a userspace implementation of an IOAPIC
and this commits aims to plumb it into the cloud-hypervisor VMM.
Here is the list of new things brought by this patch:
- Update the rust-vmm/kvm-ioctls dependency to benefit from latest
patches including the support for split irqchip, and the vector
being returned when a VM exit is caused by an EOI.
- Enable the split irqchip (which means no IOAPIC or PIC is emulated
in kernel). This is done conditionally based on the support of the
TSC_DEADLINE_TIMER from both KVM and the underlying CPU. The
dependency on TSC_DEADLINE_TIMER is related to KVM which does not
support creating the in kernel PIT if it has a split irqchip.
- Rely on callbacks to handle the following use cases:
- in kernel IOAPIC + serial IRQ (pin based)
- in kernel IOAPIC + virtio-pci MSI-X
- in kernel IOAPIC + virtio-pci IRQ (pin based)
- userspace IOAPIC + serial IRQ (pin based)
- userspace IOAPIC + virtio-pci MSI-X
- userspace IOAPIC + virtio-pci IRQ (pin based)
Fixes#13
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit anticipate the future need from having support for both
in kernel and userspace IOAPIC. The way to signal an interrupt from
the serial device will vary depending on the use case, but this should
be independent from the serial implementation itself.
That's why this patch provides a generic trait for the serial device
to call from, so that it can trigger interrupts independently from the
IOAPIC type chosen (in kernel vs userspace).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>