Add config parameter to --disk called "_disable_io_uring" (the
underscore prefix indicating it is not for public consumpion.) Use this
option to disable io_uring if it would otherwise be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to do this we must extend the MemoryManager API to add the
ability to specify the tracking of the dirty pages when creating the
userspace mappings and also keep track of the userspace mappings that
have been created for RAM regions.
Currently the dirty pages are collected into ranges based on a block
level of 64 pages. The algorithm could be tweaked to create smaller
ranges but for now if any page in the block of 64 is dirty the whole
block is added to the range.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This also removes the need to lookup up the "exe" symlink for finding
the VMM executable path.
Fixes: #1925
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that we have a new dedicated way of asking for a balloon through the
CLI and the REST API, we can move all the balloon code to the device
manager. This allows us to simplify the memory manager, which is already
quite complex.
It also simplifies the behavior of the balloon resizing command. Instead
of providing the expected size for the RAM, which is complex when memory
zones are involved, it now expects the balloon size. This is a much more
straightforward behavior as it really resizes the balloon to the desired
size. Additionally to the simplication, the benefit of this approach is
that it does not need to be tied to the memory manager at all.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The watchdog device is created through the "--watchdog" parameter. At
most a single watchdog can be created per VM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When shutting down a VM using VFIO, the following error has been
detected:
vfio-ioctls/src/vfio_device.rs:312 -- Could not delete VFIO group:
KvmSetDeviceAttr(Error(9))
After some investigation, it appears the KVM device file descriptor used
for removing a VFIO group was already closed. This is coming from the
Rust sequence of Drop, from the DeviceManager all the way down to
VfioDevice.
Because the DeviceManager owns passthrough_device, which is effectively
a KVM device file descriptor, when the DeviceManager is dropped, the
passthrough_device follows, with the effect of closing the KVM device
file descriptor. Problem is, VfioDevice has not been dropped yet and it
still needs a valid KVM device file descriptor.
That's why the simple way to fix this issue coming from Rust dropping
all resources is to make Linux accountable for it by duplicating the
file descriptor. This way, even when the passthrough_device is dropped,
the KVM file descriptor is closed, but a duplicated instance is still
valid and owned by the VfioContainer.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Small patch creating a dedicated `block_io_uring_is_supported()`
function for the non-io_uring case, so that we can simplify the
code in the DeviceManager.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because of the PCI refactoring that happened in the previous commit
d793cc4da3, the ability to fully remove a
PCI device was altered.
The refactoring was correct, but the usage of a generic function to pass
the same reference for both BusDevice, PciDevice and Any + Send + Sync
causes the Arc::ptr_eq() function to behave differently than expected,
as it does not match the references later in the code. That means we
were not able to remove the device reference from the MMIO and/or PIO
buses, which was leading to some bus range overlapping error once we
were trying to add a device again to the previous range that should have
been removed.
Fixes#1802
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The PMEM support has an option called "discard_writes" which when true
will prevent changes to the device from hitting the backing file. This
is trying to be the equivalent of "readonly" support of the block
device.
Previously the memory of the device was marked as KVM_READONLY. This
resulted in a trap when the guest attempted to write to it resulting a
VM exit (and recently a warning). This has a very detrimental effect on
the performance so instead make "discard_writes" truly CoW by mapping
the memory as `PROT_READ | PROT_WRITE` and using `MAP_PRIVATE` to
establish the CoW mapping.
Fixes: #1795
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Unlike x86_64, the "interrupt_controller" in the device manager
for AArch64 is only a `Gic` object that implements the
`InterruptController` to provide the interrupt delivery service.
This is not the real GIC device so that we do not need to save
its states. Also, we do not need to insert it to the device_tree.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The value of GIC register `GICR_TYPER` is needed in restoring
the GIC states. This commit adds a field in the GIC device struct
and a method to construct its value.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In AArch64 systems, the state of GIC device can only be
retrieved from `KVM_GET_DEVICE_ATTR` ioctl. Therefore to implement
saving/restoring the GIC states, we need to make sure that the
GIC object (either the file descriptor or the device itself) can
be extracted after the VM is started.
This commit refactors the code of GIC creation by adding a new
field `gic_device_entity` in device manager and methods to set/get
this field. The GIC object can be therefore saved in the device
manager after calling `arch::configure_system`.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
If after the creation of the self-spawned backend, the VMM cannot create
the corresponding vhost-user frontend, the VMM must kill the freshly
spawned process in order to ensure the error propagation can happen.
In case the child process would still be around, the VMM cannot return
the error as it waits onto the child to terminate.
This should help us identify when self-spawned failures are caused by a
connection being refused between the VMM and the backend.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding a new field to VirtioMemZone structure, as it lets us associate
with a particular virtio-mem region the amount of memory that should be
plugged in at boot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch simplifies the code as we have one single Option for the
VirtioMemZone. This also prepares for storing additional information
related to the virtio-mem region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit gives the possibility to create a virtio-mem device with
some memory already plugged into it. This is preliminary work to be
able to reboot a VM with the virtio-mem region being already resized.
Signed-off-by: Hui Zhu <teawater@antfin.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that virtio-mem device accept a guest NUMA node as parameter, we
retrieve this information from the list of NUMA nodes. Based on the
memory zone associated with the virtio-mem device, we obtain the NUMA
node identifier, which we provide to the virtio-mem device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement support for associating a virtio-mem device with a specific
guest NUMA node, based on the ACPI proximity domain identifier.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
For more consistency and help reading the code better, this commit
renames all 'virtiomem*' variables into 'virtio_mem*'.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Both MemoryManager and DeviceManager are updated through this commit to
handle the creation of multiple virtio-mem devices if needed. For now,
only the framework is in place, but the behavior remains the same, which
means only the memory zone created from '--memory' generates a
virtio-mem region that can be used for resize.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extract common code for adding devices to the PCI bus into its own
function from the VFIO and VIRTIO code paths.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This removes the dependency of the pci crate on the devices crate which
now only contains the device implementations themselves.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
There will be some cases where the implementation of the snapshot()
function from the Snapshottable trait will require to modify some
internal data, therefore we make this possible by updating the trait
definition with snapshot(&mut self).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It is otherwise seems to be able to cause resource conflicts with
Windows APCI_HAL. The OS might do a better job on assigning resources
to this device, withouth them to be requested explicitly. 0xcf8 and
0xcfc are only what is certainly needed for the PCI device enumeration.
Signed-off-by: Anatol Belski <anatol.belski@microsoft.com>
Some OS might check for duplicates and bail out, if it can't create a
distinct mapping. According to ACPI 5.0 section 6.1.12, while _UID is
optional, it becomes required when there are multiple devices with the
same _HID.
Signed-off-by: Anatol Belski <ab@php.net>
This patch added the seccomp_filter module to the virtio-devices crate
by taking reference code from the vmm crate. This patch also adds
allowed-list for the virtio-block worker thread.
Partially fixes: #925
Signed-off-by: Bo Chen <chen.bo@intel.com>
By adding a new io_uring feature gate, we let the user the possibility
to choose if he wants to enable the io_uring improvements or not.
Since the io_uring feature depends on the availability on recent host
kernels, it's better if we leave it off for now.
As soon as our CI will have support for a kernel 5.6 with all the
features needed from io_uring, we'll enable this feature gate
permanently.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case the host supports io_uring and the specific io_uring options
needed, the VMM will choose the asynchronous version of virtio-blk.
This will enable better I/O performances compared to the default
synchronous version.
This is also important to note the VMM won't be able to use the
asynchronous version if the backend image is in QCOW format.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cloud Hypervisor allows either the serial or virtio console to output to
TTY, but TTY input is pushed to both.
This is not correct. When Linux guest is configured to spawn TTYs on
both ttyS0 and hvc0, the user effectively issues the same commands twice
in different TTYs.
Fix this by only direct input to the one choice that is using host side
TTY.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This is a counter exposed via an I/O port that runs at 3.579545MHz. Here
we use a hardcoded I/O and expose the details through the FADT table.
TEST=Boot Linux kernel and see the following in dmesg:
[ 0.506198] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
Signed-off-by: Rob Bradford <robert.bradford@intel.com>