This reverts commit f160572f9d.
There has been increased flakiness around the live migration tests since
this was merged. Speculatively reverting to see if there is increased
stability.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to ensure that the virtio device thread is spawned from the vmm
thread we use an asynchronous activation mechanism for the virtio
devices. This change optimises that code so that we do not need to
iterate through all virtio devices on the platform in order to find the
one that requires activation. We solve this by creating a separate short
lived VirtioPciDeviceActivator that holds the required state for the
activation (e.g. the clones of the queues) this can then be stored onto
the device manager ready for asynchronous activation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Latest cargo beta version raises warnings about unused macro rules.
Simply remove them to fix the beta build.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There is no need to include serde_derive separately,
as it can be specified as serde feature instead.
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Explicitly re-export types from the hypervisor specific modules. This
makes it much clearer what the common functionality that is exposed is.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
And thus only export what is necessary through a `pub use`. This is
consistent with some of the other modules and makes it easier to
understand what the external interface of the hypervisor crate is.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By taking advantage of the fact that IrqRoutingEntry is exported by the
hypervisor crate (that is typedef'ed to the hypervisor specific version)
then the code for handling the MsiInterruptManager can be simplified.
This is particularly useful if in this future it is not a typedef but
rather a wrapper type.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This removes the requirement to leak as many datastructures from the
hypervisor crate into the vmm crate.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The trait and functionality is about operations on the VM rather than
the VMM so should be named appropriately. This clashed with with
existing struct for the concrete implementation that was renamed
appropriately.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever going through the codepath of loading a RAW firmware, we always
add an extra RAM region to the guest memory through the memory manager.
But we must be careful to use the updated guest memory rather than a
previous reference that wasn't containing the new region, as this can
lead to the following error:
VmBoot(FirmwareLoad(InvalidGuestAddress(GuestAddress(4290772992))))
This is fixed by the current patch, getting the latest reference onto
the guest memory from the memory manager right after the new region has
been added.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is required when hot-removing a vfio-user device. Details code path
below:
Thread 6 "vcpu0" received signal SIGSYS, Bad system call.
[Switching to Thread 0x7f8196889700 (LWP 2358305)]
0x00007f8196dae7ab in shutdown () at ../sysdeps/unix/syscall-template.S:78
78 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
0x00007f8196dae7ab in shutdown () at ../sysdeps/unix/syscall-template.S:78
0x000056189240737d in std::sys::unix::net::Socket::shutdown ()
at library/std/src/sys/unix/net.rs:383
std::os::unix::net::stream::UnixStream::shutdown () at library/std/src/os/unix/net/stream.rs:479
0x000056189210e23d in vfio_user::Client::shutdown (self=0x7f8190014300)
at vfio_user/src/lib.rs:787
0x00005618920b9d02 in <pci::vfio_user::VfioUserPciDevice as core::ops::drop::Drop>::drop (
self=0x7f819002d7c0) at pci/src/vfio_user.rs:551
0x00005618920b8787 in core::ptr::drop_in_place<pci::vfio_user::VfioUserPciDevice> ()
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920b92e3 in core::ptr::drop_in_place<core::cell::UnsafeCell<dyn pci::device::PciDevice>>
() at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920b9362 in core::ptr::drop_in_place<std::sync::mutex::Mutex<dyn pci::device::PciDevice>> () at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920d8a3e in alloc::sync::Arc<T>::drop_slow (self=0x7f81968852b8)
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/sync.rs:1092
0x00005618920ba273 in <alloc::sync::Arc<T> as core::ops::drop::Drop>::drop (self=0x7f81968852b8)
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/sync.rs:1688
0x00005618920b76fb in core::ptr::drop_in_place<alloc::sync::Arc<std::sync::mutex::Mutex<dyn pci::device::PciDevice>>> ()
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x0000561891b5e47d in vmm::device_manager::DeviceManager::eject_device (self=0x7f8190009600,
pci_segment_id=0, device_id=3) at vmm/src/device_manager.rs:4000
0x0000561891b674bc in <vmm::device_manager::DeviceManager as vm_device:🚌:BusDevice>::write (
self=0x7f8190009600, base=70368744108032, offset=8, data=&[u8](size=4) = {...})
at vmm/src/device_manager.rs:4625
0x00005618921927d5 in vm_device:🚌:Bus::write (self=0x7f8190006e00, addr=70368744108040,
data=&[u8](size=4) = {...}) at vm-device/src/bus.rs:235
0x0000561891b72e10 in <vmm::vm::VmOps as hypervisor::vm::VmmOps>::mmio_write (
self=0x7f81900097b0, gpa=70368744108040, data=&[u8](size=4) = {...}) at vmm/src/vm.rs:378
0x0000561892133ae2 in <hypervisor::kvm::KvmVcpu as hypervisor::cpu::Vcpu>::run (
self=0x7f8190013c90) at hypervisor/src/kvm/mod.rs:1114
0x0000561891914e85 in vmm::cpu::Vcpu::run (self=0x7f819001b230) at vmm/src/cpu.rs:348
0x000056189189f2cb in vmm::cpu::CpuManager::start_vcpu::{{closure}}::{{closure}} ()
at vmm/src/cpu.rs:953
Signed-off-by: Bo Chen <chen.bo@intel.com>
Since both Net and vhost_user::Net implement the Migratable trait, we
can factorize the common part to simplify the code related to the net
creation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since both Block and vhost_user::Blk implement the Migratable trait, we
can factorize the common part to simplify the code related to the disk
creation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend the validate() function for both DiskConfig and NetConfig so that
we return an error if a vhost-user-block or vhost-user-net device is
expected to be placed behind the virtual IOMMU. Since these devices
don't support this feature, we can't allow iommu to be set to true in
these cases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is a cleaner approach to handling the I/O port write to 0x80.
Whilst doing this also use generate the timestamp at the start of the VM
creation. For consistency use the same timestamp for the ARM equivalent.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We don't use the VmmOps trait directly for manipulating memory in the
core of the VMM as it's really designed for the MSHV crate to handle
instruction decoding. As I plan to make this trait MSHV specific to
allow reduced locking for MMIO and PIO handling when running on KVM this
use should be removed.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
To correctly map MMIO regions to the guest, we will need to wait for valid
MMIO region information which is generated from 'PciDevice::allocate_bars()'
(as a part of 'DeviceManager::add_pci_device()').
Signed-off-by: Bo Chen <chen.bo@intel.com>
For devices that cannot be named by the user use the "__" prefix to
identify them as internal devices. Check that any identifiers provided
in the config do not clash with those internal names. This prevents the
user from creating a disk such as "__serial" which would then cause a
failure in unpredictable manner.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever a device (virtio, vfio, vfio-user or vdpa) is hotplugged, we
must verify the provided identifier is unique, otherwise we must return
an error.
Particularly, this will prevent issues with identifiers for serial,
console, IOAPIC, balloon, rng, watchdog, iommu and gpio since all of
these are hardcoded by the VMM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
All hotpluggable devices were properly removed from the VmConfig when a
remove-device command was issued, except for the "fs" type. Fix this
lack of support as it is causing the integration tests to fail with the
recent addition of verifying that identifiers are unique.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The device identifiers generated from the DeviceManager were not
guaranteed to be unique since they were not taking the list of
identifiers provided through the configuration.
By returning the list of unique identifiers from the configuration, and
by providing it to the DeviceManager, the generation of new identifiers
can rely both on the DeviceTree and the list of IDs from the
configuration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The socket will safely deleted on shutdown and so it is not necessary to
delete the API socket when starting the HTTP server.
Fixes: #4026
Signed-off-by: LiHui <andrewli@kubesphere.io>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Start loading the kernel as possible in the VM in a separate thread.
Whilst it is loading other work can be carried out such as initialising
the devices.
The biggest performance improvement is seen with a more complex set of
devices. If using e.g. four virtio-net devices then the time to start the
kernel improves by 20-30ms. With the simplest configuration the
improvement was of the order of 2-3ms.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the same code for generating the kernel command line to be
used on both aarch64 and x86_64 when the latter starts loading the
kernel in asynchronously.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is not required for x86_64 and maintains a tight coupling between
kernel loading and the DeviceManager.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This reverts commit 87eed369cd.
The reason we're reverting this is that OpenAPI Specification[0] doesn't
know how to deal with unsigned types. :-/
Right now the best to do is keep it as it's, as an int64, and try to fix
OpenAPI, or even switch to swagger, as the latter knows how to properly
deal with those. However, switching to swagger is far from being an 1:1
transition and will require time to experiment, thus reverting this for
now seems the best approach.
[0]: https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.1.0.md#data-types
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The Token Bucket fields are, on the Cloud Hypervisor side, u64.
However, we expose those as int64 in the OpenAPI YAML file.
With that in mind, let's adjust the yaml file to expose those as uint64.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This means that the automatic enabling of the virtio-iommu will also be
applied to VMs creates via the API as well as the CLI.
Fixes: #4016
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If using the ACPI based hotplug only memory can be added so if the
hotplug RAM size is the same as the boot RAM size then do not include
the memory manager DSDT entries.
Also: this change simplifies the code marginally by making the
HotplugMethod enum Copyable.
This was identified from the following perf output:
1.78% 0.00% vmm cloud-hypervisor [.] <vmm::memory_manager::MemorySlots as acpi_tables::aml::Aml>::append_aml_bytes
|
---<vmm::memory_manager::MemorySlots as acpi_tables::aml::Aml>::append_aml_bytes
<vmm::memory_manager::MemorySlot as acpi_tables::aml::Aml>::append_aml_bytes
acpi_tables::aml::Name::new
<acpi_tables::aml::Path as acpi_tables::aml::Aml>::append_aml_bytes
__libc_malloc
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
No further changes are necessary that adding a #[derive(Error)] as there
is a manual implementation of Display.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Extend VfioCommon structure to own the MSI interrupt manager. This will
be useful for implementing the restore code path.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This carries a string that is exposed via DMI/SMBIOS and is particularly
useful for cloud-init initialisation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a single enum member for representing errors from the internal API.
This avoids the ugly duplication of the API call name in the error
message:
e.g.
$ target/debug/ch-remote --api-socket /tmp/api resize --cpus 2
Error running command: Server responded with an error: InternalServerError: VmResize(VmResize(CpuManager(DesiredVCpuCountExceedsMax)))
Becomes:
$ target/debug/ch-remote --api-socket /tmp/api resize --cpus 2
Error running command: Server responded with an error: InternalServerError: ApiError(VmResize(CpuManager(DesiredVCpuCountExceedsMax)))
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of defining some very generic resources as PioAddressRange or
MmioAddressRange for each PCI BAR, let's move to the new Resource type
PciBar in order to make things clearer. This allows the code for being
more readable, but also removes the need for hard assumptions about the
MMIO and PIO ranges. PioAddressRange and MmioAddressRange types can be
used to describe everything except PCI BARs. BARs are very special as
they can be relocated and have special information we want to carry
along with them.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to make the code more consistent and easier to read, we remove
the former tuple that was used to describe a BAR, replacing it with the
existing structure PciBarConfiguration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By factorizing the algorithm untangling TDVF sections from guest RAM
into a dedicated function, we can write some unit tests to validate it
properly achieves what we expect.
Adding the "tdx" feature to the unit tests, otherwise it wouldn't get
tested.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By adding a new method id() to the PciDevice trait, we allow the caller
to retrieve a unique identifier. This is used in the context of BAR
relocation to identify the device being relocated, so that we can update
the DeviceTree resources for all PCI devices (and not only
VirtioPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By returning the new PCI resources from add_pci_device(), we allow the
factorization of the code translating the BARs into resources. This
allows VIRTIO, VFIO and vfio-user to add the resources to the DeviceTree
node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Relying on the function introduced recently to get the PCI resources and
handle the restore case, both VFIO and vfio-user device creation paths
now have access to PCI resources, which can be provided to the function
add_pci_device().
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a dedicated function for getting the PCI segment, b/d/f and
optional resources. This is meant for handling the potential case of a
restore.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Updating the way of restoring BAR addresses for virtio-pci by providing
a more generic approach that will be reused for other PciDevice
implementations (i.e VfioPcidevice and VfioUserPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The calls to these functions are always preceded by a call to
InterruptSourceGroup::update(). By adding a masked boolean to that
function call it possible to remove 50% of the calls to the
KVM_SET_GSI_ROUTING ioctl as the the update will correctly handle the
masked or unmasked case.
This causes the ioctl to disappear from the perf report for a boot of
the VM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
EDK2 execution requires a flash device at address 0.
The new added device is not a fully functional flash. It doesn't
implement any spec of a flash device. Instead, a piece of memory is used
to simulate the flash simply.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Rust 2021 edition has a few improvements over the 2018 edition. Migrate
the project to 2021 edition by following recommended migration steps.
Luckily, the code itself doesn't require fixing.
Bump MSRV to 1.56 as it is required by the 2021 edition. Also fix the
clap build dependency to make Cloud Hypervisor build again.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This is a refactoring commit to simplify source code.
Removed some functions that only return a layout const.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Some addresses defined in `layout.rs` were of type `GuestAddress`, and
are `u64`. Now align the types of all the `*_START` definitions to
`GuestAddress`.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The reserved space is for devices.
Some devices (like TPM) require arbitrary addresses close to 4GiB.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
`RAM_64BIT_START` was set to 1 GiB, not a real 64-bit address. Now
rename it `RAM_START` to avoid confusion.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Add a new iommu parameter to VdpaConfig in order to place the vDPA
device behind a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The list of memory resources provided through the HOB wasn't accurate
because of the broken logic. The fix provides correct ranges to the
firmware.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on latest QEMU patches from branch tdx-qemu-2022.03.29-v7.0.0-rc1
we should only report as memory resources the TempMem sections from TDVF
sections.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The introduction of a error if live resizing is not possible is a
regression compared to the original behaviour where the new size would
be stored in the config and reflected in the next boot. This behaviour
was also inconsistent with the effect of resizing with no VM booted.
Instead of generating an error allow the code to go ahead and update the
config so that the new size will be available upon the reboot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Similarly to the previous commit restricting the cpu resizing error only
to the situations where the vcpu amount has changed, let's do the same
with the memory and be consistent throughout our code base.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
188078467d made clear that resize should
only happen when dealing with a "dynamic" CpuManager. Although this is
very much correct, it causes a regression on Kata Containers (and on any
other consumer of Cloud Hypervisor) in cases where a resize would be
triggered but the vCPUs values wouldn't be changed.
There's no doubt Kata Containers could do better and do not call a
resize in such situations, and that's something that should **also** be
solved there. However, we should also work this around on Cloud
Hypervisor side as it introduces a regression with the current Kata
Containers code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
By enabling the VIRTIO feature VIRTIO_F_IOMMU_PLATFORM for all
vhost-user devices when needed, we force the guest to use the DMA API,
making these devices compatible with TDX. By using DMA API, the guest
triggers the TDX codepath to share some of the guest memory, in
particular the virtqueues and associated buffers so that the VMM and
vhost-user backends/processes can access this memory.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If EFI reset fails on the Linux kernel then it will fallthrough to CMOS
reset. Implement this as one of our reset solutions.
Fixes: #3912
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Compile this feature in by default as it's well supported on both
aarch64 and x86_64 and we only officially support using it (no non-acpi
binaries are available.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
AMX is an x86 extension adding hardware units for matrix
operations (int and float dot products). The goal of the extension is
to provide performance enhancements for these common operations.
On Linux, AMX requires requesting the permission from the kernel prior
to use. Guests wanting to make use of the feature need to have the
request made prior to starting the vm.
This change then adds the first --cpus features option amx that when
passed will enable AMX usage for guests (needs a 5.17+ kernel) or
exits with failure.
The activation is done in the CpuManager of the VMM thread as it
allows migration and snapshot/restore to work fairly painlessly for
AMX enabled workloads.
Signed-off-by: William Douglas <william.douglas@intel.com>
Disable the DAX feature from the virtio-fs implementation as the feature
is still not stable. The feature is deprecated, meaning the 'dax'
parameter will be removed in about 2 releases cycles.
In the meantime, the parameter value is ignored and forced to be
disabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running non-dynamic or with virtio-mem for hotplug the ACPI
functionality should not be included on the DSDT nor does the
MemoryManager need to be placed on the MMIO bus.
Fixes: #3883
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is now consistent with not supplying the _CRS for the device when
CpuManager is not dynamic.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than just printing a message return an error back through the API
if the user attempts to hotplug a device that supports being behind an
IOMMU where that device isn't placed on an IOMMU segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Ensure devices that are specified to be on a PCI segment that is behind
the IOMMU are IOMMU enabled if possible or error out for those devices
that do not support it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Previously it was not possible to enable vIOMMU for a virtio device.
However with the ability to place an entire PCI segment behind the
IOMMU the IOMMU mapping needs to be setup for the virtio device if it is
behind the IOMMU.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This can already be calculated by the summing the tables reported by the
Linux kernel but this is more convenient.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Separate the destruction and cleanup of original VM and the creation of
the new one. In particular have a clear hand off point for resources
(e.g. reset EventFd) used by the new VM from the original. In the
situation where vm.shutdown() generates an error this also avoids the
Vmm reference to the Vm (self.vm) from being maintained.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the newly added Vdpa device along with the new vdpa parameter,
this patch enables the support for vDPA devices.
It's important to note this the only virtio device for which we provide
an ExternalDmaMapping instance. This will allow for the right DMA ranges
to be mapped/unmapped.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduce a new --vdpa parameter associated with a VdpaConfig for the
future creation of a Vdpa device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This will significantly reduce the size of the DSDT and the effort
required to parse them if there is no requirement to support
hotplug/unplug.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the CpuManager is dynamic it devices CPUs can be
hotplugged/unplugged.
Since TDX does not support CPU hotplug this is currently the only
determinator as to whether the CpuManager is dynamic.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
vmm.ping/vm.info will hang for PUT method, vm.create/vmm.shutdonw hang for GET method.
Because these four APIs do not write the response body when the HTTP method does not match.
Signed-off-by: LiHui <andrewli@kubesphere.io>
In case the virtio device which requires DMA mapping is placed behind a
virtual IOMMU, we shouldn't map/unmap any region manually. Instead, we
provide the DMA handler to the virtio-iommu device so that it can
trigger the proper mappings.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If a virtio device is associated with a DMA handler, the DMA mapping and
unmapping is performed from the device manager through the handler.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Given that some virtio device might need some DMA handling, we provide a
way to store this through the VirtioPciDevice layer, so that it can be
accessed when the PCI device is removed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for handling potential DMA mapping/unmapping operations for a
virtio device, we extend the MetaVirtioDevice with an additional field
that holds an optional DMA handler.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The tuple of information related to each virtio device is too big, and
it's better to factorize it through a dedicated structure.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When mask a msi irq, we set the entry.masked to be true, so kvm
hypervisor will not pass the gsi to kernel through KVM_SET_GSI_ROUTING
ioctl which update kvm->irq_routing. This will trigger kernel
panic on AMD platform when the gsi is the largest one in kernel
kvm->irqfds.items:
crash> bt
PID: 22218 TASK: ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8"
#0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397
#1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d
#2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d
#3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d
#4 [ffffb1ba6707fb90] no_context at ffffffff856692c9
#5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51
#6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace
[exception RIP: svm_update_pi_irte+227]
RIP: ffffffffc0761b53 RSP: ffffb1ba6707fd08 RFLAGS: 00010086
RAX: ffffb1ba6707fd78 RBX: ffffb1ba66d91000 RCX: 0000000000000001
RDX: 00003c803f63f1c0 RSI: 000000000000019a RDI: ffffb1ba66db2ab8
RBP: 000000000000019a R8: 0000000000000040 R9: ffff94ca41b82200
R10: ffffffffffffffcf R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000001 R14: ffffffffffffffcf R15: 000000000000005f
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm]
#8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm]
#9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm]
RIP: 00007f143c36488b RSP: 00007f143a4e04b8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f05780041d0 RCX: 00007f143c36488b
RDX: 00007f05780041d0 RSI: 000000004008ae6a RDI: 0000000000000020
RBP: 00000000000004e8 R8: 0000000000000008 R9: 00007f05780041e0
R10: 00007f0578004560 R11: 0000000000000246 R12: 00000000000004e0
R13: 000000000000001a R14: 00007f1424001c60 R15: 00007f0578003bc0
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
To solve this problem, move route.disable() before set_gsi_routes() to
remove the gsi from irqfds.items first.
This problem only exists on AMD platform, 'cause on Intel platform
kernel just return when update irte while it only prints a warning on
AMD.
Also, this patch adjusts the order of enable() and set_gsi_routes() in
unmask(), which should do no harm.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
When mask a msi irq, we set the entry.masked to be true, so kvm
hypervisor will not pass the gsi to kernel through KVM_SET_GSI_ROUTING
ioctl which update kvm->irq_routing. This will trigger kernel
panic on AMD platform when the gsi is the largest one in kernel
kvm->irqfds.items:
crash> bt
PID: 22218 TASK: ffff951a6ad74980 CPU: 73 COMMAND: "vcpu8"
#0 [ffffb1ba6707fa40] machine_kexec at ffffffff8565b397
#1 [ffffb1ba6707fa90] __crash_kexec at ffffffff85788a6d
#2 [ffffb1ba6707fb58] crash_kexec at ffffffff8578995d
#3 [ffffb1ba6707fb70] oops_end at ffffffff85623c0d
#4 [ffffb1ba6707fb90] no_context at ffffffff856692c9
#5 [ffffb1ba6707fbf8] exc_page_fault at ffffffff85f95b51
#6 [ffffb1ba6707fc50] asm_exc_page_fault at ffffffff86000ace
[exception RIP: svm_update_pi_irte+227]
RIP: ffffffffc0761b53 RSP: ffffb1ba6707fd08 RFLAGS: 00010086
RAX: ffffb1ba6707fd78 RBX: ffffb1ba66d91000 RCX: 0000000000000001
RDX: 00003c803f63f1c0 RSI: 000000000000019a RDI: ffffb1ba66db2ab8
RBP: 000000000000019a R8: 0000000000000040 R9: ffff94ca41b82200
R10: ffffffffffffffcf R11: 0000000000000001 R12: 0000000000000001
R13: 0000000000000001 R14: ffffffffffffffcf R15: 000000000000005f
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffffb1ba6707fdb8] kvm_irq_routing_update at ffffffffc09f19a1 [kvm]
#8 [ffffb1ba6707fde0] kvm_set_irq_routing at ffffffffc09f2133 [kvm]
#9 [ffffb1ba6707fe18] kvm_vm_ioctl at ffffffffc09ef544 [kvm]
RIP: 00007f143c36488b RSP: 00007f143a4e04b8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: 00007f05780041d0 RCX: 00007f143c36488b
RDX: 00007f05780041d0 RSI: 000000004008ae6a RDI: 0000000000000020
RBP: 00000000000004e8 R8: 0000000000000008 R9: 00007f05780041e0
R10: 00007f0578004560 R11: 0000000000000246 R12: 00000000000004e0
R13: 000000000000001a R14: 00007f1424001c60 R15: 00007f0578003bc0
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
To solve this problem, move route.disable() before set_gsi_routes() to
remove the gsi from irqfds.items first.
This problem only exists on AMD platform, 'cause on Intel platform
kernel just return when update irte while it only prints a warning on
AMD.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Move to release version v0.2.0 for both vm-virtio and vhost-user-backend
crates rather than relying on their main branch, as they might be
subject to breaking changes.
Fixes#3800
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a field for its length and fix up users.
Things work just because all hardcoded values agree with each other.
This is prone to breakage.
No functional change.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This commit adds event fds and the event handler to send/receive
requests and responses from the GDB thread. It also adds `--gdb` flag to
enable GDB stub feature.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `stop_on_boot` to `Vm` so that the VM stops before
starting on boot requested. This change is required to keep the target
VM stopped before a debugger attached as the user expected.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `Vm::debug_request` to handle `GdbRequestPayload`,
which will be sent from the GDB thread.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds initial gdb.rs implementation for `Debuggable` trait to
describe a debuggable component. Some part of the trait bound
implementations is based on the crosvm GDB stub code [1].
[1] https://github.com/google/crosvm/blob/main/src/gdb.rs
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `KVM_SET_GUEST_DEBUG` and `KVM_TRANSLATE` ioctls to
seccomp filter to enable guest debugging without `--seccomp=false`.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `VmState::BreakPoint` to handle hardware breakpoint.
The VM will enter this state when a breakpoint hits or a debugger
interrupts the execution.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
42b5d4a2f7 has changed how the PciBdf
field of a DeviceNode is represented (from an int32 to its own struct).
To avoid marshelling / demarshelling issues for the projects relying on
the openapi auto generated code, let's propagate the change, updating
the yaml file accordingly.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
`Dies per package` setting of VCPU topology doesnot apply on AArch64.
Now we only accept `1` value. This way we can make the `dies` field
transparent, avoid it from impacting the topology setting.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Based on the helpers from the hypervisor crate, the VMM can identify
what type of hypercall has been issued through the KVM_EXIT_TDX reason.
For now, we only log warnings and set the status to INVALID_OPERAND
since these hypercalls aren't supported. The proper handling will be
implemented later.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since the object returned from CpuManager.create_vcpu() is never used,
we can avoid the cloning of this object.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By having the DeviceNode storing a PciBdf, we simplify the internal code
as well as allow for custom Serialize/Deserialize implementation for the
PciBdf structure. These custom implementations let us display the PCI
s/b/d/f in a human readable format.
Fixes#3711
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As we've added support for cold adding devices to a VM that was created
but not already started, we should propagate the `204` response
generated on those cases to the yaml file, so openapi-generator can
produce the correct client code on the go side, to handle both `200` and
`204` successful results.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Instead of erroring out when trying to change the configuration of the
VM somewhere between the VM was created but not yet booted, let's allow
users to change that without any issue, as long as the VM has already
been created.
Fixes: #3639
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add very basic unit for the vm_add_$device() functions, so we can
easily expand those when changing its behaviour in the coming commits.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Instead of doing the validation of the configuration change as part of
the vm, let's do this in the uper layer, in the Vmm.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's move add_to_config to config.rs so it can be used from both inside
and outside of the vm.rs file.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
TDx support is already present on the project for quite some time, but
the TDx configuration was not yet exposed to the ones using CH via the
OpenAPI auto generated code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Since the devices behind the IOMMU cannot be changed at runtime we offer
the ability to place all devices on user chosen segments behind the
IOMMU. This allows the hotplugging of devices behind the IOMMU provided
that they are assigned to a segment that is located behind the iommu.
Fixes: #911
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding a new parameter free_page_reporting=on|off to the balloon device
so that we can enable the corresponding feature from virtio-balloon.
Running a VM with a balloon device where this feature is enabled allows
the guest to report pages that are free from guest's perspective. This
information is used by the VMM to release the corresponding pages on the
host.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to allow for human readable output for the VM configuration, we
pull it out of the snapshot, which becomes effectively the list of
states from the VM. The configuration is stored through a dedicated file
in JSON format (not including any binary output).
Having the ability to read and modify the VM configuration manually
between the snapshot and restore phases makes debugging easier, as well
as empowers users for extending the use cases relying on the
snapshot/restore feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As per this kernel documentation:
For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN,
KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding
operations are complete (and guest state is consistent) only after userspace
has re-entered the kernel with KVM_RUN. The kernel side will first finish
incomplete operations and then check for pending signals.
The pending state of the operation is not preserved in state which is
visible to userspace, thus userspace should ensure that the operation is
completed before performing a live migration. Userspace can re-enter the
guest with an unmasked signal pending or with the immediate_exit field set
to complete pending operations without allowing any further instructions
to be executed.
Since we capture the state as part of the pause and override it as part
of the resume we must ensure the state is consistent otherwise we will
lose the results of the MMIO or PIO operation that caused the exit from
which we paused.
Fixes: #3658
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If a payload is found in the TDVF section, and after it's been copied to
the guest memory, make sure to create the corresponding TdPayload
structure and insert it through the HOB.
Signed-off-by: Jiaqi Gao <jiaqi.gao@intel.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case of TDX, if a kernel and/or a command line are provided by the
user, they can't be treated the same way as for the non-TDX case. That
is why this patch ensures the function load_kernel() is only invoked for
the non-TDX case.
For the TDX case, whenever TDVF contains a Payload and/or PayloadParam
sections, the file provided through --kernel and the parameters provided
through --cmdline are copied at the locations specified by each TDVF
section.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The TDVF specification has been updated with the ability to provide a
specific payload, which means we will be able to achieve direct kernel
boot.
For that reason, let's not prevent the user from using --kernel
parameter when running with TDX.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make sure Cloud Hypervisor relies on upstream and actively maintained
vfio-ioctls crate from the rust-vmm/vfio repository instead of the
deprecated version coming from rust-vmm/vfio-ioctls repository.
Fixes#3673
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that we introduced a separate method to indicate when the migration
is started, both start_dirty_log() and stop_dirty_log() don't have to
carry an implicit meaning as they can focus entirely on the dirty log
being started or stopped.
For that reason, we can now safely move stop_dirty_log() to the code
section performing non-local migration. It makes only sense to stop
logging dirty pages if this has been started before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to clearly decouple when the migration is started compared to
when the dirty logging is started, we introduce a new method to the
Migratable trait. This clarifies the semantics as we don't end up using
start_dirty_log() for identifying when the migration has been started.
And similarly, we rely on the already existing complete_migration()
method to know when the migration has been ended.
A bug was reported when running a local migration with a vhost-user-net
device in server mode. The reason was because the migration_started
variable was never set to "true", since the start_dirty_log() function
was never invoked.
Signed-off-by: lizhaoxin1 <Lxiaoyouling@163.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
While cloud-hypervisor does support receiving the file descriptors of a
tuntap device, advertising the fds structure via the openAPI can lead to
misinterpretations of what can and what should be done.
An unadvertised consumer will think that they could rather just set the
file descriptors there directly, or even pass them as a byte array.
However, the proper way to go in those cases would be actually sending
those via send_msg(), together with the request.
As hacking the openAPI auto-generated code to properly do this is not
*that* trivial, and as doing so during a `create VM` request is not
supported, we better not advertising those.
Please, for more details, also check:
https://github.com/cloud-hypervisor/cloud-hypervisor/pull/3607#issuecomment-1020935523
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Now that all the preliminary work has been merged to make Cloud
Hypervisor work with the upstream crate virtio-queue from
rust-vmm/vm-virtio repository, we can move the whole codebase and remove
the local copy of the virtio-queue crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the latest code from the micro-http crate, this patch adds the
support for multiple file descriptors to be sent along with the add-net
request. This means we can now hotplug multiqueue network interface to
the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Moving the whole codebase to rely on the AccessPlatform definition from
vm-virtio so that we can fully remove it from virtio-queue crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If a PMU is enabled in a VM, we also need to initialize the PMU
when the VM is restored. Otherwise, vCPUs cannot be started after
the VM is restored.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
When enable PMU on arm64, ioctl with group KVM_HAS_DEVICE_ATTR will be
blocked by seccomp, add it to authorized list.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Relying on helpers for creating the ACPI tables and to add each table to
the HOB, this patch connects the dot to provide the set of ACPI tables
to the TD firmware.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The way to create ACPI tables for TDX is different as each table must be
passed through the HOB. This means the XSDT table is not required since
the firmware will take care of creating it. Same for RSDP, this is
firmware responsibility to provide it to the guest.
That's why this patch creates a TDX dedicated function, returning a list
of Sdt objects, which will let the calling code copy the content of each
table through the HOB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case of TDX, we don't want to create the ACPI tables the same way we
do for all the other use cases. That's because the ACPI tables don't
need to be written to guest memory at a specific address, instead they
are passed directly through the HOB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's been decided the ACPI tables will be passed to the firmware in a
different way, rather than using TD_VMM_DATA. Since TD_VMM_DATA was
introduced for this purpose, there's no reason to keep it in our
codebase.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This table is listed as required in the ARM Base Boot Requirements
document. The particular need arises to make the serial debugging of
Windows guest functional.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Ensure all pending virtio activations (as triggered by MMIO write on the
vCPU threads leading to a barrier wait) are completed before pausing the
vCPUs as otherwise there will a deadlock with the VMM waiting for the
vCPU to acknowledge it's pause and the vCPU waiting for the VMM to
activate the device and release the barrier.
Fixes: #3585
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
These are the leftovers from the commit 8155be2:
arch: aarch64: vm_memory is not required when configuring vcpu
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
When in local migration mode send the FDs for the guest memory over the
socket along with the slot that the FD is associated with. This removes
the requirement for copying the guest RAM and gives significantly faster
live migration performance (of the order of 3s to 60ms).
Fixes: #3566
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the VM using the FDs (wrapped in Files) that have been received
during the migration process.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If this FD (wrapped in a File) is supplied when the RAM region is being
created use that over creating a new one.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This function is used to open an FD (wrapped in a File) that points to
guest memory from memfd_create() or backed on the filesystem.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Drop the unused parameter throughout the code base.
Also take the chance to drop a needless clone.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
We've been currently using "fd" as the field name, but it should be
called "fds" since 6664e5a6e7 introduced
the name change on the structure field.
Fixes: #3560
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
`tap` has its default value set to `None`, but in the openapi yaml file
we've been setting it to `""`.
When using this code on the Kata Containers side we'd be hit by a non
expected behaviour of cloud-hypervisor, as even when using a different
method to initialise the `tuntap` device the code would be treated as if
using `--net tap` (which is a valid use-case).
Related: #3554
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
warning: unnecessary use of `to_string`
--> vmm/src/config.rs:2199:38
|
2199 | ... .get(&memory_zone.to_string())
| ^^^^^^^^^^^^^^^^^^^^^^^^ help: use: `memory_zone`
|
= note: `#[warn(clippy::unnecessary_to_owned)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_to_owned
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: unneeded late initalization
--> vmm/src/acpi.rs:525:5
|
525 | let mut prev_tbl_len: u64;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(clippy::needless_late_init)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_late_init
help: declare `prev_tbl_len` here
|
552 | let mut prev_tbl_len: u64 = madt.len() as u64;
| ~~~~~~~~~~~~~~~~~~~~~~~~~
warning: unneeded late initalization
--> vmm/src/acpi.rs:526:5
|
526 | let mut prev_tbl_off: GuestAddress;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_late_init
help: declare `prev_tbl_off` here
|
553 | let mut prev_tbl_off: GuestAddress = madt_offset;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This crate was used in the integration tests to allow the tests to
continue and clean up after a failure. This isn't necessary in the unit
tests and adds a large build dependency chain including an unmaintained
crate.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the underlying kernel is old PTY resize is disabled and this is
represented by the use of None in the provided Option<File> type. In the
virtio-console PTY path don't blindly unwrap() the value that will be
preserved across a reboot.
Fixes: #3496
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It's important to maintain the ability to run in a non-TDX environment
a Cloud Hypervisor binary with the 'tdx' feature enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Setting up the SIGWINCH handler requires at least Linux 5.7. However
this functionality is not required for basic PTY operation.
Fixes: #3456
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Advertise the PCI MMIO config spaces here so that the MMIO config space
is correctly recognised.
Tested by: --platform num_pci_segments=1 or 16 hotplug NVMe vfio-user device
works correctly with hypervisor-fw & OVMF and direct kernel boot.
Fixes: #3432
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When restoring replace the internal value of the device tree rather than
replacing the Arc<Mutex<DeviceTree>> itself. This is fixes an issue
where the AddressManager has a copy of the the original
Arc<Mutex<DeviceTree>> from when the DeviceManager was created. The
original restore path only replaced the DeviceManager's version of the
Arc<Mutex<DeviceTree>>. Instead replace the contents of the
Arc<Mutex<DeviceTree>> so all users see the updated version.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Saving the KVM clock and restoring it is key for correct behaviour of
the VM when doing snapshot/restore or live migration. The clock is
restored to the KVM state as part of the Vm::resume() method prior to
that it must be extracted from the state object and stored for later use
by this method. This change simplifies the extraction and storage part
so that it is done in the same way for both snapshot/restore and live
migration.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The constant `PCI_MMIO_CONFIG_SIZE` defined in `vmm/pci_segment.rs`
describes the MMIO configuation size for each PCI segment. However,
this name conflicts with the `PCI_MMCONFIG_SIZE` defined in `layout.rs`
in the `arch` crate, which describes the memory size of the PCI MMIO
configuration region.
Therefore, this commit renames the `PCI_MMIO_CONFIG_SIZE` to
`PCI_MMIO_CONFIG_SIZE_PER_SEGMENT` and moves this constant from `vmm`
crate to `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Currently, a tuple containing PCI space start address and PCI space
size is used to pass the PCI space information to the FDT creator.
In order to support the multiple PCI segment for FDT, more information
such as the PCI segment ID should be passed to the FDT creator. If we
still use a tuple to store these information, the code flexibility and
readablity will be harmed.
To address this issue, this commit replaces the tuple containing the
PCI space information to a structure `PciSpaceInfo` and uses a vector
of `PciSpaceInfo` to store PCI space information for each segment, so
that multiple PCI segment information can be passed to the FDT together.
Note that the scope of this commit will only contain the refactor of
original code, the actual multiple PCI segments support will be in
following series, and for now `--platform num_pci_segments` should only
be 1.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In order to avoid the identity map region to conflict with a possible
firmware being placed in the last 4MiB of the 4GiB range, we must set
the address to a chosen location. And it makes the most sense to have
this region placed right after the TSS region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Place the 3 page TSS at an explicit location in the 32-bit address space
to avoid conflicting with the loaded raw firmware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Added fields:
- `Memory address size limit`: the missing of this field triggered
warnings in guest kernel
- `Node ID`
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
After introducing multiple PCI segments, the `devid` value in
`kvm_irq_routing_entry` exceeds the maximum supported range on AArch64.
This commit restructed the `devid` to the allowed range.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
If the provided binary isn't an ELF binary assume that it is a firmware
to be loaded in directly. In this case we shouldn't program any of the
registers as KVM starts in that state.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
That function call can return -1 when it fails. Wrapping -1 into File
causes the code to panic when the File is dropped.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
I encountered some trouble trying to use a virtio-console hooked up to
a PTY. Reading from the PTY would produce stuff like this
"\n\nsh-5.1# \n\nsh-5.1# " (where I'm just pressing enter at a shell
prompt), and a terminal would render that like this:
----------------------------------------------------------------
sh-5.1#
sh-5.1#
----------------------------------------------------------------
This was because we weren't disabling the ICRNL termios iflag, which
turns carriage returns (\r) into line feeds (\n). Other raw mode
implementations (like QEMU's) set this flag, and don't have this
problem.
Instead of fixing our raw mode implementation to just disable ICRNL,
or copy the flags from QEMU's, though, here I've changed it to use the
raw mode implementation in libc. It seems to work correctly in my
testing, and means we don't have to worry about what exactly raw mode
looks like under the hood any more.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Fix seccomp violation when trying to add the out FD to the epoll loop
when the serial buffer needs to be flushed.
0x00007ffff7dc093e in epoll_ctl () at ../sysdeps/unix/syscall-template.S:120
0x0000555555db9b6d in epoll::ctl (epfd=56, op=epoll::ControlOptions::EPOLL_CTL_MOD, fd=55, event=...)
at /home/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/epoll-4.3.1/src/lib.rs:155
0x00005555556f5127 in vmm::serial_buffer::SerialBuffer::add_out_poll (self=0x7fffe800b5d0) at vmm/src/serial_buffer.rs:101
0x00005555556f583d in vmm::serial_buffer::{impl#1}::write (self=0x7fffe800b5d0, buf=...) at vmm/src/serial_buffer.rs:139
0x0000555555a30b10 in std::io::Write::write_all<vmm::serial_buffer::SerialBuffer> (self=0x7fffe800b5d0, buf=...)
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1527
0x0000555555ab82fb in devices::legacy::serial::Serial::handle_write (self=0x7fffe800b520, offset=0, v=13) at devices/src/legacy/serial.rs:217
0x0000555555ab897f in devices::legacy::serial::{impl#2}::write (self=0x7fffe800b520, _base=1016, offset=0, data=...) at devices/src/legacy/serial.rs:295
0x0000555555f30e95 in vm_device:🚌:Bus::write (self=0x7fffe8006ce0, addr=1016, data=...) at vm-device/src/bus.rs:235
0x00005555559406d4 in vmm::vm::{impl#4}::pio_write (self=0x7fffe8009640, port=1016, data=...) at vmm/src/vm.rs:459
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When running with `--serial pty --console pty --seccomp=false` the
SIGWICH listener thread would panic as the seccomp filter was empty.
Adopt the mechanism used in the rest of the code and check for non-empty
filter before trying to apply it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the introduction of a new option `affinity` to the `cpus`
parameter, Cloud Hypervisor can now let the user choose the set
of host CPUs where to run each vCPU.
This is useful when trying to achieve CPU pinning, as well as making
sure the VM runs on a specific NUMA node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Give the option parser the ability to handle tuples with inner brackets
containing list of integers. The following example can now be handled
correctly "option=[key@[v1-v2,v3,v4]]" which means the option is
assigned a tuple with a key associated with a list of integers between
the range v1 - v2, as well as v3 and v4.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because anyhow version 1.0.46 has been yanked, let's move back to the
previous version 1.0.45.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Always properly initialize vectors so that we don't run in undefined
behaviors when the vector gets dropped.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Creates a new generic type Tuple so that the same implementation of
FromStr trait can be reused for both parsing a list of two integers and
parsing a list of one integer associated with a list of integers.
This anticipates the need for retrieving sublists, which will be needed
when trying to describe the host CPU affinity for every vCPU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The elements of a list should be using commas as the correct delimiter
now that it is supported. Deprecate use of colons as delimiter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This allocator allocates 64-bit MMIO addresses for use with platform
devices e.g. ACPI control devices and ensures there is no overlap with
PCI address space ranges which can cause issues with PCI device
remapping.
Use this allocator the ACPI platform devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than use the system MMIO allocator for RAM use an allocator that
covers the full RAM range.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is because the SGX region will be placed between the end of ram and
the start of the device area.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the segment id now encoded in the bdf it is not necessary to have
the separate field for it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In particular use the accessor for getting the device id from the bdf.
As a side effect the VIOT table is now segment aware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have a non-overlapping memory range associated
with it the device memory must be equally divided amongst all segments.
A new allocator is used for each segment to ensure that BARs are
allocated from the correct address ranges. This requires changes to
PciDevice::allocate/free_bars to take that allocator and when
reallocating BARs the correct allocator must be identified from the
ranges.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For all the devices that support being hotplugged (disk, net, pmem, fs
and vsock) add "pci_segment" option and propagate that through to the
addition onto the PCI busses.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the decision on whether to use a 64-bit bar up to the DeviceManager
so that it can use both the device type (e.g. block) and the PCI segment
ID to decide what size bar should be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Generate a set of 8 IRQs and round-robin distribute those over all the
slots for a bus. This same set of IRQs is then used for all PCI
segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The platform config may specify a number of PCI segments to use, if this
greater than 1 then we add supplemental PCI segments as well as the
default segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This currently contains only the number over PCI segments to create.
This is limited to 16 at the moment which should allow 496 user specified
PCI devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For the bus scanning the GED AML code now calls into a PSCN method that
scans all buses. This approach was chosen since it handles the case
correctly where one GED interrupt is services for two hotplugs on
distinct segments.
The PCIU and PCID field values are now determined by the PSEG field that
is uses to select which segment those values should be used for.
Similarly _EJ0 will notify based on the value of _SEG.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Replace the hardcoded zero PCI segment id when adding devices to the bus
and extend the DeviceTree to hold the PCI segment id.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have disjoint address spaces only advertise
address space in the 32-bit range and the PIO address space on the
default (zero) segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Split PciSegment::new_default_segment() into a separate
PciSegment::new() and those parts required only for the default segment
(PIO PCI config device.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For now this still contains just one segment but is expanding in
preparation for more segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit moves the code that generates the DSDT data for the PCI bus
into PciSegment making no functional changes to the generated AML.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Variables that start with underscore are used to silence rustc.
Normally those variables are not used in code.
This patch drops the underscore from variables that are used. This is
less confusing to readers.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This makes it much easier to use since the info!() level produces far
fewer messages and thus has less overhead.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Relying on the vm-virtio/virtio-queue crate from rust-vmm which has been
copied inside the Cloud Hypervisor tree, the entire codebase is moved to
the new definition of a Queue and other related structures.
The reason for this move is to follow the upstream until we get some
agreement for the patches that we need on top of that to make it
properly work with Cloud Hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This was causing some issues because of the use of 2 different versions
for the vm-memmory crate. We'll wait for all dependencies to be properly
resolved before we move to 0.7.0.
This reverts commit 76b6c62d07.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When allocating PCI MMIO BARs they should always be naturally aligned
(i.e. aligned to the size of the BAR itself.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The allocate_bars method has a side effect which collates the BARs used
for the device and stores them internally. Ensure that any use of this
internal state is after the state is created otherwise no MMIO regions
will be seen and so none will be mapped.
Fixes: #3237
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of creating a MemoryManager from scratch, let's reuse the same
code path used by snapshot/restore, so that memory regions are created
identically to what they were on the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that all the pieces are in place, we can restore a VM with the new
codepath that restores properly all memory regions, allowing for ACPI
memory hotplug to work properly with snapshot/restore feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extending the MemoryManager::new() function to be able to create a
MemoryManager from data that have been previously stored instead of
always creating everything from scratch.
This change brings real added value as it allows a VM to be restored
respecting the proper memory layout instead of hoping the regions will
be created the way they were before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Storing multiple data coming from the MemoryManager in order to be able
to restore without creating everything from scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new function will be able to restore memory regions and memory
zones based on the GuestMemoryMapping list that will be provided through
snapshot/restore and migration phases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This can help identifying which zone relates to which memory range.
This is going to be useful when recreating GuestMemory regions from
the previous layout instead of having to recreate everything from
scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a dedicated function to factorize the allocation of the memory
ranges, and helping with the simplification of MemoryManager::new()
function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By updating the list of GuestMemory regions with the virtio-mem ones
before the creation of the MemoryManager, we know the GuestMemory is up
to date and the allocation of memory ranges is simplified afterwards.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to simplify MemoryManager::new() function. let's move the
memory configuration validation to its own function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Move the PciSegment struct and the associated code to a new file. This
will allow some clearer separation between the core DeviceManager and
PCI handling.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the PCI related state from the DeviceManager struct to a PciSegment
struct inside the DeviceManager. This is in preparation for multiple
segment support. Currently this state is just the bus itself, the MMIO
and PIO config devices and hotplug related state.
The main change that this required is using the Arc<Mutex<PciBus>> in
the device addition logic in order to ensure that
the bus could be created earlier.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When using PVH for booting (which we use for all firmwares and direct
kernel boot) the Linux kernel does not configure LA57 correctly. As such
we need to limit the address space to the maximum 4-level paging address
space.
If the user knows that their guest image can take advantage of the
5-level addressing and they need it for their workload then they can
increase the physical address space appropriately.
This PR removes the TDX specific handling as the new address space limit
is below the one that that code specified.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever running TDX, we must pass the ACPI tables to the TDVF firmware
running in the guest. The proper way to do this is by adding the tables
to the TdHob as a TdVmmData type, so that TDVF will know how to access
these tables and expose them to the guest OS.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of having the ACPI tables being created both in x86_64 and
aarch64 implementations of configure_system(), we can remove the
duplicated code by moving the ACPI tables creation in vm.rs inside the
boot() function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The argument `prefault` is provided in MemoryManager, but it can
only be used by SGX and restore.
With prefault (MAP_POPULATE) been set, subsequent page faults will
decrease during running, although it will make boot slower.
This commit adds `prefault` in MemoryConfig and MemoryZoneConfig.
To resolve conflict between memory and restore, argument
`prefault` has been changed from `bool` to `Option<bool>`, when
its value is None, config from memory will be used, otherwise
argument in Option will be used.
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
By using a single file for storing the memory ranges, we simplify the
way snapshot/restore works by avoiding multiples files, but the main and
more important point is that we have now a way to save only the ranges
that matter. In particular, the ranges related to virtio-mem regions are
not always fully hotplugged, meaning we don't want to save the entire
region. That's where the usage of memory ranges is interesting as it
lets us optimize the snapshot/restore process when one or multiple
virtio-mem regions are involved.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The function memory_range_table() will be reused by the MemoryManager in
a following patch to describe all the ranges that we should snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Copy only the memory ranges that have been plugged through virtio-mem,
allowing for an interesting optimization regarding the time it takes to
migrate a large virtio-mem device. Even if the hotpluggable space is
very large (say 64GiB), if only 1GiB has been previously added to the
VM, only 1GiB will be sent to the destination VM, avoiding the transfer
of the remaining 63GiB which are unused.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By creating the BlocksState object in the MemoryManager, we can directly
provide it to the virtio-mem device when being created. This will allow
the MemoryManager through each VirtioMemZone to have a handle onto the
blocks that are plugged at any point in time.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This will be helpful to support the creation of a MemoryRangeTable from
virtio-mem, as it uses 2M pages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding the snapshot/restore support along with migration as well,
allowing a VM with virtio-mem devices attached to be properly
migrated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The amount of memory plugged in the virtio-mem region should always be
kept up to date in the hotplugged_size field from VirtioMemZone.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no need to duplicate the GuestMemory for snapshot purpose, as we
always have a handle onto the GuestMemory through the guest_memory
field.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we only support a single PCI bus right now advertise only a single
bus in the ACPI tables. This reduces the number of VM exits from probing
substantially.
Number of PCI config I/O port exits: 17871 -> 1551 (91% reduction) with
direct kernel boot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a simpler method for extracting the affected slot on the eject
command. Also update the terminology to reflect that this a slot rather
than a bdf (which is what device id refers to elsewhere.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Refactor the serial buffer handling in order to write the serial
buffer's output to a PTY connected after the serial device stops being
written to by the guest.
This change moves the serial buffer initialization inside the serial
manager. That is done to allow the serial buffer to be made aware of
the PTY and epoll fds needed in order to modify the
EpollDispatch::File trigger. These are then used by the serial buffer
to trigger an epoll event when the PTY fd is writable and the buffer
has content in it. They are also used to remove the trigger when the
buffer is emptied in order to avoid unnecessary wake-ups.
Signed-off-by: William Douglas <william.douglas@intel.com>
Both read_exact_from() and write_all_to() functions from the GuestMemory
trait implementation in vm-memory are buggy. They should retry until
they wrote or read the amount of data that was expected, but instead
they simply return an error when this happens. This causes the migration
to fail when trying to send important amount of data through the
migration socket, due to large memory regions.
This should be eventually fixed in vm-memory, and here is the link to
follow up on the issue: https://github.com/rust-vmm/vm-memory/issues/174
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement the infrastructure that lets a virtio-mem device map the guest
memory into the device. This is necessary since with virtio-mem zones
memory can be added or removed and the vfio-user device must be
informed.
Fixes: #3025
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By moving this from the VfioUserPciDevice to DeviceManager the client
can be reused for handling DMA mapping behind an IOMMU.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For vfio-user the mapping handler is per device and needs to be removed
when the device in unplugged.
For VFIO the mapping handler is for the default VFIO container (used
when no vIOMMU is used - using a vIOMMU does not require mappings with
virtio-mem)
To represent these two use cases use an enum for the handlers that are
stored.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Looking up devices on the port I/O bus is time consuming during the
boot at there is an O(lg n) tree lookup and the overhead from taking a
lock on the bus contents.
Avoid this by adding a fast path uses the hardcoded port address and
size and directs PCI config requests directly to the device.
Command line:
target/release/cloud-hypervisor --kernel ~/src/linux/vmlinux --cmdline "root=/dev/vda1 console=ttyS0" --serial tty --console off --disk path=~/workloads/focal-server-cloudimg-amd64-custom-20210609-0.raw --api-socket /tmp/api
PIO exit: 17913
PCI fast path: 17871
Percentage on fast path: 99.8%
perf before:
marvin:~/src/cloud-hypervisor (main *)$ perf report -g | grep resolve
6.20% 6.20% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
perf after:
marvin:~/src/cloud-hypervisor (2021-09-17-ioapic-fast-path *)$ perf report -g | grep resolve
0.08% 0.08% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
The compromise required to implement this fast path is bringing the
creation of the PciConfigIo device into the DeviceManager::new() so that
it can be used in the VmmOps struct which is created before
DeviceManager::create_devices() is called.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The MSI IOVA address on X86 and AArch64 is different.
This commit refactored the code to receive the MSI IOVA address and size
from device_manager, which provides the actual IOVA space data for both
architectures.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Add a virtio-iommu node into FDT if iommu option is turned on. Now we
support only one virtio-iommu device.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This change switches from handling serial input in the VMM thread to
its own thread controlled by the SerialManager.
The motivation for this change is to avoid the VMM thread being unable
to process events while serial input is happening and vice versa.
The change also makes future work flushing the serial buffer on PTY
connections easier.
Signed-off-by: William Douglas <william.douglas@intel.com>
This change adds a SerialManager with its own epoll handling that
should be created and run by the DeviceManager when creating an
appropriately configured console (serial tty or pty).
Both stdin and pty input are handled by the SerialManager. The stdin
and pty specific methods used by the VMM should be removed in a future
commit.
Signed-off-by: William Douglas <william.douglas@intel.com>
The clone method for PtyPair should have been an impl of the Clone
trait but the method ended up not being used. Future work will make
use of the trait however so correct the missing trait implementation.
Signed-off-by: William Douglas <william.douglas@intel.com>
For most use cases, there is no need to create multiple VFIO containers
as it causes unwanted behaviors. Especially when passing multiple
devices from the same IOMMU group, we need to use the same container so
that it can properly list the groups that have been already opened. The
correct logic was already there in vfio-ioctls, but it was incorrectly
used from our VMM implementation.
For the special case where we put a VFIO device behind a vIOMMU, we must
create one container per device, as we need to control the DMA mappings
per device, which is performed at the container level. Because we must
keep one container per device, the vIOMMU use case prevents multiple
devices attached to the same IOMMU group to be passed through the VM.
But this is a limitation that we are fine with, especially since the
vIOMMU doesn't let us group multiple devices in the same group from a
guest perspective.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize. This is the only way to be notified
by the kernel of a pty resize.
We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in. To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty. The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal. This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.
Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity. I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.
I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process. The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This prepares us to be able to handle console resizes in the console
device's epoll loop, which we'll have to do if the output is a pty,
since we won't get SIGWINCH from it.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Musl often uses mmap to allocate memory where Glibc would use brk.
This has caused seccomp violations for me on the API and signal
handling threads.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
error: all if blocks contain the same code at the end
--> vmm/src/memory_manager.rs:884:9
|
884 | / Ok(mm)
885 | | }
| |_________^
Signed-off-by: Bo Chen <chen.bo@intel.com>
This concept ends up being broken with multiple types on input connected
e.g. console on TTY and serial on PTY. Already the code for checking for
injecting into the serial device checks that the serial is configured.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a common solution for spawning the virtio threads which will
make it easier to add the panic handling.
During this effort I discovered that there were no seccomp filters
registered for the vhost-user-net thread nor the vhost-user-block
thread. This change also incorporates basic seccomp filters for those as
part of the refactoring.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Current AArch64 power button is only for device tree using a PL061
GPIO controller device. Since AArch64 now supports ACPI, this
commit extend the power button on AArch64 to:
- Using GED for ACPI+UEFI boot.
- Using PL061 for device tree boot.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
These statements are useful for understanding the cause of reset or
shutdown of the VM and are not spammy so should be included at info!()
level.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Despite setting up a dedicated thread for signal handling, we weren't
making sure that the signals we were listening for there were actually
dispatched to the right thread. While the signal-hook provides an
iterator API, so we can know that we're only processing the signals
coming out of the iterator on our signal handling thread, the actual
signal handling code from signal-hook, which pushes the signals onto
the iterator, can run on any thread. This can lead to seccomp
violations when the signal-hook signal handler does something that
isn't allowed on that thread by our seccomp policy.
To reproduce, resize a terminal running cloud-hypervisor continuously
for a few minutes. Eventually, the kernel will deliver a SIGWINCH to
a thread with a restrictive seccomp policy, and a seccomp violation
will trigger.
As part of this change, it's also necessary to allow rt_sigreturn(2)
on the signal handling thread, so signal handlers are actually allowed
to run on it. The fact that this didn't seem to be needed before
makes me think that signal handlers were almost _never_ actually
running on the signal handling thread.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Move the processing of the input from stdin, PTY or file from the VMM
thread to the existing virtio-console thread. The handling of the resize
of a virtio-console has not changed but the name of the struct used to
support that has been renamed to reflect its usage.
Fixes: #3060
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Downcasting of GicDevice trait might fail. Therefore we try to
downcast the trait first and only if the downcasting succeeded we
can then use the object to call methods. Otherwise, do nothing and
log the failure.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements the GIC (including both GICv3 and GICv3ITS)
Pausable trait. The pause of device manager will trigger a "pause"
of GIC, where we flush GIC pending tables and ITS tables to the
guest RAM.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This prevents the boot of the guest kernel from being blocked by
blocking I/O on the serial output since the data will be buffered into
the SerialBuffer.
Fixes: #3004
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a dynamic buffer for storing output from the serial port. The
SerialBuffer implements std::io::Write and can be used in place of the
direct output for the serial device.
The internals of the buffer is a vector that grows dynamically based on
demand up to a fixed size at which point old data will be overwritten.
Currently the buffer is only flushed upon writes.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The rust-vmm crates we're pulling from git have renamed their main
branches. We need to update the branch names we're giving to Cargo,
or people who don't have these dependencies cached will get errors
like this when trying to build:
error: failed to get `vm-fdt` as a dependency of package `arch v0.1.0 (/home/src/cloud-hypervisor/arch)`
Caused by:
failed to load source for dependency `vm-fdt`
Caused by:
Unable to update https://github.com/rust-vmm/vm-fdt?branch=master#031572a6
Caused by:
object not found - no match for id (031572a6edc2f566a7278f1e17088fc5308d27ab); class=Odb (9); code=NotFound (-3)
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Remove the indirection of a dispatch table and simply use the enum as
the event data for the events.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use two separate events for the console and serial PTY and then drive
the handling of the inputs on the PTY separately. This results in the
correct behaviour when both console and serial are attached to the PTY
as they are triggered separately on the epoll so events are not lost.
Fixes: #3012
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Check the config to find out which device is attached to the tty and
then send the input from the user into that device (serial or
virtio-console.)
Fixes: #3005
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
vhdx_sync.rs in block_util implements traits to represent the vhdx
crate as a supported block device in the cloud hypervisor. The vhdx
is added to the block device list in device_manager.rs at the vmm
crate so that it can automatically detect a vhdx disk and invoke the
corresponding crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Fazla Mehrab <akm.fazla.mehrab@intel.com>
We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.
Signed-off-by: Bo Chen <chen.bo@intel.com>
It is forbidden that the same memory zone belongs to more than one
NUMA node. This commit adds related validation to the `--numa`
parameter to prevent the user from specifying such configuration.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The optional device tree node distance-map describes the relative
distance (memory latency) between all NUMA nodes.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This is to make sure the NUMA node data structures can be accessed
both from the `vmm` crate and `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The AArch64 platform provides a NUMA binding for the device tree,
which means on AArch64 platform, the NUMA setup can be extended to
more than the ACPI feature.
Based on above, this commit extends the NUMA setup and data
structures to following scenarios:
- All AArch64 platform
- x86_64 platform with ACPI feature enabled
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Michael Zhao <Michael.Zhao@arm.com>
Instead of panicking with an expect() function, the QcowDiskSync::new
function now propagates the error properly. This ensures the VMM will
not panic, which might be the source of weird errors if only one thread
exits while the VMM continues to run.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We cannot let vhost-user devices connect to the backend when the Block,
Fs or Net object is being created during a restore/migration. The reason
is we can't have two VMs (source and destination) connected to the same
backend at the same time. That's why we must delay the connection with
the vhost-user backend until the restoration is performed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code wasn't doing what it was expected to. The '?' was simply
returning the error to the top level function, meaning the Err() case in
the match was never hit. Moving the whole logic to a dedicated function
allows to identify when something got wrong without propagating to the
calling function, so that we can still stop the dirty logging and
unpause the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case the migration succeeds, the destination VM will be correctly
running, with potential vhost-user backends attached to it. We can't let
the source VM trying to reconnect to the same backends, which is why
it's safer to shutdown the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for creating vhost-user devices in a different way when
being restored compared to a fresh start, this commit introduces a new
boolean created by the Vm depending on the use case, and passed down to
the DeviceManager. In the future, the DeviceManager will use this flag
to assess how vhost-user devices should be created.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Correct operation of user devices (vfio-user) requires shared memory so
flag this to prevent it from failing in strange ways.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the vfio-user / user devices from the config. Currently hotplug
of the devices is not supported nor can they be placed behind the
(virt-)iommu.
Removal of the coldplugged device is however supported.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the user to specify devices that are running in a different
userspace process and communicated with vfio-user.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Allow vsocks to connect to Unix sockets on the host running
cloud-hypervisor with enabled seccomp.
Reported-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Tested-by: Franz Girlich <franz.girlich@tu-ilmenau.de>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
This doesn't really affect the build as we ship a Cargo.lock with fixed
versions in. However for clarity it makes sense to use fixed versions
throughout and let dependabot update them.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The optional Processor Properties Topology Table (PPTT) table is
used to describe the topological structure of processors controlled
by the OSPM, and their shared resources, such as caches. The table
can also describe additional information such as which nodes in the
processor topology constitute a physical package.
The ACPI PPTT table supports topology descriptions for ACPI guests.
Therefore, this commit adds the PPTT table for AArch64 to enable
CPU topology feature for ACPI.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In an Arm system, the hierarchy of CPUs is defined through three
entities that are used to describe the layout of physical CPUs in
the system:
- cluster
- core
- thread
All these three entities have their own FDT node field. Therefore,
This commit adds an AArch64-specific helper to pass the config from
the Cloud Hypervisor command line to the `configure_system`, where
eventually the `create_fdt` is called.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Make sure the DeviceManager is triggered for all migration operations.
The dirty pages are merged from MemoryManager and DeviceManager before
to be sent up to the Vmm in lib.rs.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that Migratable provides the methods for starting, stopping and
retrieving the dirty pages, we move the existing code to these new
functions.
No functional change intended.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch connects the dots between the vm.rs code and each Migratable
device, in order to make sure Migratable methods are correctly invoked
when migration happens.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for supporting the merge of multiple dirty pages coming
from multiple devices, this patch factorizes the creation of a
MemoryRangeTable from a bitmap, as well as providing a simple method for
merging the dirty pages regions under a single MemoryRangeTable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch adds all the seccomp rules missing for MSHV.
With this patch MSFT internal CI runs with seccomp enabled.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch adds a fallback path for sending live migration, where it
ensures the following behavior of source VM post live-migration:
1. The source VM will be paused only when the migration is completed
successfully, or otherwise it will keep running;
2. The source VM will always stop dirty pages logging.
Fixes: #2895
Signed-off-by: Bo Chen <chen.bo@intel.com>
This rule is needed to boot windows guest.
This bug was introduced while we tried to boot
windows guest on MSHV.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch modify the existing live migration code
to support MSHV. Adds couple of new functions to enable
and disable dirty page tracking. Add missing IOCTL
to the seccomp rules for live migration.
Adds necessary flags for MSHV.
This changes don't affect KVM functionality at all.
In order to get better performance it is good to
enable dirty page tracking when we start live migration
and disable it when the migration is done.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Right now, get_dirty_log API has two parameters,
slot and memory_size.
MSHV needs gpa to retrieve the page states. GPA is
needed as MSHV returns the state base on PFN.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
It's totally acceptable to snapshot and restore a virtio-fs device that
has no cache region, since this is a valid mode of functioning for
virtio-fs itself.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As we are now using an global control to start/stop dirty pages log from
the `hypervisor` crate, we need to explicitly tell the hypervisor (KVM)
whether a region needs dirty page tracking when it is created.
This reverts commit f063346de3.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Following KVM interfaces, the `hypervisor` crate now provides interfaces
to start/stop the dirty pages logging on a per region basis, and asks
its users (e.g. the `vmm` crate) to iterate over the regions that needs
dirty pages log. MSHV only has a global control to start/stop dirty
pages log on all regions at once.
This patch refactors related APIs from the `hypervisor` crate to provide
a global control to start/stop dirty pages log (following MSHV's
behaviors), and keeps tracking the regions need dirty pages log for
KVM. It avoids leaking hypervisor-specific behaviors out of the
`hypervisor` crate.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch adds a common function "Vmm::vm_check_cpuid_compatibility()"
to be shared by both live-migration and snapshot/restore.
Signed-off-by: Bo Chen <chen.bo@intel.com>
We now send not only the 'VmConfig' at the 'Command::Config' step of
live migration, but also send the 'common CPUID'. In this way, we can
check the compatibility of CPUID features between the source and
destination VMs, and abort live migration early if needed.
Signed-off-by: Bo Chen <chen.bo@intel.com>
With the support of dynamically turning on/off dirty-pages-log during
live-migration (only for guest RAM regions), we now can create guest
memory regions without dirty-pages-log by default both for guest RAM
regions and other regions backed by file/device.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch extends slightly the current live-migration code path with
the ability to dynamically start and stop logging dirty-pages, which
relies on two new methods added to the `hypervisor::vm::Vm` Trait. This
patch also contains a complete implementation of the two new methods
based on `kvm` and placeholders for `mshv` in the `hypervisor` crate.
Fixes: #2858
Signed-off-by: Bo Chen <chen.bo@intel.com>
Whenever a file descriptor is sent through the control message, it
requires fcntl() syscall to handle it, meaning we must allow it through
the list of syscalls authorized for the HTTP thread.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running TDX guest, the Guest Physical Address space is limited by
a shared bit that is located on bit 47 for 4 level paging, and on bit 51
for 5 level paging (when GPAW bit is 1). In order to keep things simple,
and since a 47 bits address space is 128TiB large, we ensure to limit
the physical addressable space to 47 bits when runnning TDX.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running a TDX guest, we need the virtio drivers to use the DMA API
to share specific memory pages with the VMM on the host. The point is to
let the VMM get access to the pages related to the buffers pointed by
the virtqueues.
The way to force the virtio drivers to use the DMA API is by exposing
the virtio devices with the feature VIRTIO_F_IOMMU_PLATFORM. This is a
feature indicating the device will require some address translation, as
it will not deal directly with physical addresses.
Cloud Hypervisor takes care of this requirement by adding a generic
parameter called "force_iommu". This parameter value is decided based on
the "tdx" feature gate, and then passed to the DeviceManager. It's up to
the DeviceManager to use this parameter on every virtio device creation,
which will imply setting the VIRTIO_F_IOMMU_PLATFORM feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This refactoring ensures all CPUID related operations are centralized in
`arch::x86_64` module, and exposes only two related public functions to
the vmm crate, e.g. `generate_common_cpuid` and `configure_vcpu`.
Signed-off-by: Bo Chen <chen.bo@intel.com>
In order to let a separate process open a TAP device and pass the file
descriptor through the control message mechanism, this patch adds the
support for sending a file descriptor over to the Cloud Hypervisor
process along with the add-net HTTP API command.
The implementation uses the NetConfig structure mutably to update the
list of fds with the one passed through control message. The list should
always be empty prior to this, as it makes no sense to provide a list of
fds once the Cloud Hypervisor process has already been started.
It is important to note that reboot is supported since the file
descriptor is duplicated upon receival, letting the VM only use the
duplicated one. The original file descriptor is kept open in order to
support a potential reboot.
Fixes#2525
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are some seccomp rules needed for MSHV
in virtio-devices but not for KVM. We only want to
add those rules based on MSHV feature guard.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The micro-http crate now uses recvmsg() syscall in order to receive file
descriptors through control messages. This means the syscall must be
part of the authorized list in the seccomp filters.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The vfio-ioctls crate now contains a KVM feature gate. Make use of it in
Cloud Hypervisor.
That crate has two users. For the vmm crate is it straight-forward. For
the vm-device crate, we introduce a KVM feature gate as well so that the
vmm crate can pass on the configuration.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This new option allows the user to define a list of SGX EPC sections
attached to a specific NUMA node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to uniquely identify each SGX EPC section, we introduce a
mandatory option `id` to the `--sgx-epc` parameter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The guest can see that SGX supports provisioning as it is exposed
through the CPUID. This patch enables the proper backing of this
feature by having the host open the provisioning device and enable
this capability through the hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch fixes a few things to support TDVF correctly.
The HOB memory resources must contain EFI_RESOURCE_ATTRIBUTE_ENCRYPTED
attribute.
Any section with a base address within the already allocated guest RAM
must not be allocated.
The list of TD_HOB memory resources should contain both TempMem and
TdHob sections as well.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Previously the same function was used to both create and remove regions.
This worked on KVM because it uses size 0 to indicate removal.
MSHV has two calls -- one for creation and one for removal. It also
requires having the size field available because it is not slot based.
Split set_user_memory_region to {create/remove}_user_memory_region. For
KVM they still use set_user_memory_region underneath, but for MSHV they
map to different functions.
This fixes user memory region removal on MSHV.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
The to-be-introduced MSHV rules don't need to contain KVM rules and vice
versa.
Put KVM constants into to a module. This avoids the warnings about
dead code in the future.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This commit introduces the `ProcessorGiccAffinity` struct for the
AArch64 platform. This struct will be created and included into
the SRAT table to enable AArch64 NUMA setup.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
It ensures all handlers for `ApiRequest` in `control_loop` are
consistent and minimum and should read better.
No functional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
It simplifies a bit the `Vmm::control_loop` and reads better to be
consistent with other `ApiRequest` handlers. Also, it removes the
repetitive `ApiError::VmAlreadyCreated` and makes `ApiError::VmCreate`
useful.
No functional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
We have been building Cloud Hypervisor with command like:
`cargo build --no-default-features --features ...`.
After implementing ACPI, we donot have to use specify all features
explicitly. Default build command `cargo build` can work.
This commit fixed some build warnings with default build option and
changed github workflow correspondingly.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
error: avoid using `collect()` when not needed
--> vmm/src/vm.rs:630:86
|
630 | let node_id_list: Vec<u32> = configs.iter().map(|cfg| cfg.guest_numa_id).collect();
| ^^^^^^^
...
664 | if !node_id_list.contains(&dest) {
| ---------------------------- the iterator could be used here instead
|
= note: `-D clippy::needless-collect` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_collect
Signed-off-by: Bo Chen <chen.bo@intel.com>
Issue from beta verion of clippy:
Error: --> vm-virtio/src/queue.rs:700:59
|
700 | if let Some(used_event) = self.get_used_event(&mem) {
| ^^^^ help: change this to: `mem`
|
= note: `-D clippy::needless-borrow` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
Signed-off-by: Bo Chen <chen.bo@intel.com>
Issue from beta verion of clippy:
error: field is never read: `type`
--> vmm/src/cpu.rs:235:5
|
235 | pub r#type: u8,
| ^^^^^^^^^^^^^^
|
= note: `-D dead-code` implied by `-D warnings`
Signed-off-by: Bo Chen <chen.bo@intel.com>
The Linux kernel expects that any PCI devices that advertise I/O bars
have use an address that is within the range advertised by the bus
itself. Unfortunately we were not advertising any I/O ports associated
with the PCI bus in the ACPI tables.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to allow a hotplugged vCPU to be assigned to the correct NUMA
node in the guest, the DSDT table must expose the _PXM method for each
vCPU. This method defines the proximity domain to which each vCPU should
be attached to.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The _PXM method always return 0, which is wrong since the SRAT might
tell differently. The point of the _PXM method is to be evaluated by the
guest OS when some new memory slot is being plugged, but this will never
happen for Cloud Hypervisor since using NUMA nodes along with memory
hotplug only works for virtio-mem.
Memory hotplug through ACPI will only happen when there's only one NUMA
node exposed to the guest, which means the _PXM method won't be needed
at all.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make sure the unique PCI bus is tied to the default NUMA node 0, and
update the documentation to let the users know about this special case.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Sometimes we need balloon deflate automatically to give memory
back to guest, especially for some low priority guest processes
under memory pressure. Enable deflate_on_oom to support this.
Usage: --balloon "size=0,deflate_on_oom=on" \
Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Since using the VIRTIO configuration to expose the virtual IOMMU
topology has been deprecated, the virtio-iommu implementation must be
updated.
In order to follow the latest patchset that is about to be merged in the
upstream Linux kernel, it must rely on ACPI, and in particular the newly
introduced VIOT table to expose the information about the list of PCI
devices attached to the virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implemented an architecture specific function for loading UEFI binary.
Changed the logic of loading kernel image:
1. First try to load the image as kernel in PE format;
2. If failed, try again to load it as formatless UEFI binary.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
_DSM (Device Specific Method) is a control method that enables devices
to provide device specific control functions. Linux kernel will evaluate
this device then initialize preserve_config in acpi pci initialization.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Live migration currently handles guest memory writes from the guest
through the KVM dirty page tracking and sends those dirty pages to the
destination. This patch augments the live migration support with dirty
page tracking of writes from the VMM to the guest memory(e.g. virtio
devices).
Fixes: #2458
Signed-off-by: Bo Chen <chen.bo@intel.com>
Function "GuestMemory::with_regions(_mut)" were mainly temporary methods
to access the regions in `GuestMemory` as the lack of iterator-based
access, and hence they are deprecated in the upstream vm-memory crate [1].
[1] https://github.com/rust-vmm/vm-memory/issues/133
Signed-off-by: Bo Chen <chen.bo@intel.com>
As the first step to complete live-migration with tracking dirty-pages
written by the VMM, this commit patches the dependent vm-memory crate to
the upstream version with the dirty-page-tracking capability. Most
changes are due to the updated `GuestMemoryMmap`, `GuestRegionMmap`, and
`MmapRegion` structs which are taking an additional generic type
parameter to specify what 'bitmap backend' is used.
The above changes should be transparent to the rest of the code base,
e.g. all unit/integration tests should pass without additional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
After adding "get_interrupt_controller()" function in DeviceManager,
"enable_interrupt_controller()" became redundant, because the latter
one is the a simple wrapper on the interrupt controller.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The function used to calculate "gicr-typer" value has nothing with
DeviceManager. Now it is moved to AArch64 specific files.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
We thought we could move the control queue to the backend as it was
making some good sense. Unfortunately, doing so was a wrong design
decision as it broke the compatibility with OVS-DPDK backend.
This is why this commit moves the control queue back to the VMM side,
meaning an additional thread is being run for handling the communication
with the guest.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
On FDT, VMM can allocate IRQ from 0 for devices.
But on ACPI, the lowest range below 32 has to be avoided.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This commit enables the PSCI (Power State Coordination Interface)
for the AArch64 platform, which allows the VMM to manage the power
status of the guest. Also, multiple vCPUs can be brought up using
PSCI.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements the IO Remapping Table (IORT) for AArch64.
The IORT is one of the required ACPI table for AArch64, since
it describes the GICv3ITS node.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements an AArch64-required ACPI table: Serial
Port Console Redirection Table (SPCR). The table provides
information about the configuration and use of the serial port
or non-legacy UART interface.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements an AArch64-specific ACPI table: Generic
Timer Description Table (GTDT). The GTDT provides OSPM with
information about a system’s Generic Timers configuration.
The Generic Timer (GT) is a standard timer interface implemented
on ARM processor-based systems.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Added the final PCI bus number in MCFG table. This field is mandatory on
AArch64. On X86 it is optional.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Simplified definition block of CPU's on AArch64. It is not complete yet.
Guest boots. But more is to do in future:
- Fix the error in ACPI definition blocks (seen in boot messages)
- Implement CPU hot-plug controller
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
In migration, vm object is created by new_from_migration with
NULL kvm clock. so vm.set_clock will not be called during vm resume.
If the guest using kvm-clock, the ticks will be stopped after migration.
As clock was already saved to snapshot, add a method to restore it before
vm resume in migration. after that, guest's kvm-clock works well.
Signed-off-by: Ren Lei <ren.lei4@zte.com.cn>
Connecting a restored KVM clock vm will take long time, as clock
is NOT restored immediately after vm resume from snapshot.
this is because 9ce6c3b incorrectly remove vm_snapshot.clock, and
always pass None to new_from_memory_manager, which will result to
kvm_set_clock() never be called during restore from snapshot.
Fixes: 9ce6c3b
Signed-off-by: Ren Lei <ren.lei4@zte.com.cn>
Now that the control queue is correctly handled by the backend, there's
no need to handle it as well from the VMM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Before this change, the FDT was loaded at the end of RAM. The address of
FDT was not fixed.
While UEFI (edk2 now) requires fixed address to find FDT and RSDP.
Now the FDT is moved to the beginning of RAM, which is a fixed address.
RSDP is wrote to 2 MiB after FDT, also a fixed address.
Kernel comes 2 MiB after RSDP.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
These messages are predominantly during the boot process but will also
occur during events such as hotplug.
These cover all the significant steps of the boot and can be helpful for
diagnosing performance and functionality issues during the boot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Separate the population of the memory and the HOB from the TDX
initialisation of the memory so that the latter can happen after the CPU
is initialised.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now all crates use edition = "2018" then the majority of the "extern
crate" statements can be removed. Only those for importing macros need
to remain.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the support for an OVS vhost-user backend to connect as the
vhost-user client. This means we introduce with this patch a new
option to our `--net` parameter. This option is called 'server' in order
to ask the VMM to run as the server for the vhost-user socket.
Fixes#1745
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a temporary copy of the config, add the new device and validate
that. This needs to be done separately to adding it to the config to
avoid race conditions that might be result in config changes being
overwritten.
Fixes: #2564
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
To handle that devices are stored in an Option<Vec<T>> and reduce
duplicated code use generic function to add the devices to the the
struct.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The latest kvm-sgx code has renamed sgx_virt_epc device node
to sgx_vepc. Update cloud-hypervisor code and documentation to
follow this.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Because the http thread no longer needs to create the api socket,
remove the socket, bind and listen syscalls from the seccomp filter.
Signed-off-by: William Douglas <william.douglas@intel.com>
Instead of using the http server's method to have it create the
fd (causing the http thread to need to support the socket, bind and
listen syscalls). Create the socket fd in the vmm thread and use the
http server's new method supporting passing in this fd for the api
socket.
Signed-off-by: William Douglas <william.douglas@intel.com>
To avoid race issues where the api-socket may not be created by the
time a cloud-hypervisor caller is ready to look for it, enable the
caller to pass the api-socket fd directly.
Avoid breaking current callers by allowing the --api-socket path to be
passed as it is now in addition to through the path argument.
Signed-off-by: William Douglas <william.r.douglas@gmail.com>
In order to support using Versionize for state structures it is necessary
to use simpler, primitive, data types in the state definitions used for
snapshot restore.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Windows guests read this field upon PCI device ejection. Let's make sure
we don't return an error as this is valid. We simply return an empty u32
since the ejection is done right away upon write access, which means
there's no pending ejection that might be reported to the guest.
Here is the error that was shown during PCI device removal:
ERROR:vmm/src/device_manager.rs:3960 -- Accessing unknown location at
base 0x7ffffee000, offset 0x8
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of tracking on a block level of 64 pages, we are now collecting
dirty pages one by one. It improves the efficiency of dirty memory
tracking while live migration.
Signed-off-by: Bo Chen <chen.bo@intel.com>
By disabling this KVM feature, we prevent the guest from using APF
(Asynchronous Page Fault) mechanism. The kernel has recently switched to
using interrupts to notify about a page being ready, but for some
reasons, this is causing unexpected behavior with Cloud Hypervisor, as
it will make the vcpu thread spin at 100%.
While investigating the issue, it's better to disable the KVM feature to
prevent 100% CPU usage in some cases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The original code had a generic type E. It was later replaced by a
concrete type. The code should have been simplified when the replacement
happened.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
The MCRS method returns a 64-bit memory range descriptor. The
calculation is supposed to be done as follows:
max = min + len - 1
However, every operand is represented not as a QWORD but as combination
of two DWORDs for high and low part. Till now, the calculation was done
this way, please see also inline comments:
max.lo = min.lo + len.lo //this may overflow, need to carry over to high
max.hi = min.hi + len.hi
max.hi = max.hi - 1 // subtraction needs to happen on the low part
This calculation has been corrected the following way:
max.lo = min.lo + len.lo
max.hi = min.hi + len.hi + (max.lo < min.lo) // check for overflow
max.lo = max.lo - 1 // subtract from low part
The relevant part from the generated ASL for the MCRS method:
```
Method (MCRS, 1, Serialized)
{
Acquire (MLCK, 0xFFFF)
\_SB.MHPC.MSEL = Arg0
Name (MR64, ResourceTemplate ()
{
QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000000000000, // Range Minimum
0xFFFFFFFFFFFFFFFE, // Range Maximum
0x0000000000000000, // Translation Offset
0xFFFFFFFFFFFFFFFF, // Length
,, _Y00, AddressRangeMemory, TypeStatic)
})
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._MIN, MINL) // _MIN: Minimum Base Address
CreateDWordField (MR64, 0x12, MINH)
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._MAX, MAXL) // _MAX: Maximum Base Address
CreateDWordField (MR64, 0x1A, MAXH)
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._LEN, LENL) // _LEN: Length
CreateDWordField (MR64, 0x2A, LENH)
MINL = \_SB.MHPC.MHBL
MINH = \_SB.MHPC.MHBH
LENL = \_SB.MHPC.MHLL
LENH = \_SB.MHPC.MHLH
MAXL = (MINL + LENL) /* \_SB_.MHPC.MCRS.LENL */
MAXH = (MINH + LENH) /* \_SB_.MHPC.MCRS.LENH */
If ((MAXL < MINL))
{
MAXH += One /* \_SB_.MHPC.MCRS.MAXH */
}
MAXL -= One
Release (MLCK)
Return (MR64) /* \_SB_.MHPC.MCRS.MR64 */
}
```
Fixes#1800.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Fixes the current codebase so that every cargo clippy can be run with
the beta toolchain without any error.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are two parts:
- Unconditionally zero the output area. The length of the incoming
vector has been seen from 1 to 4 bytes, even though just the first
byte might need to be handled. But also, this ensures any possibly
unhandled offset will return zeroed result to the caller. The former
implementation used an I/O port which seems to behave differently from
MMIO and wouldn't require explicit output zeroing.
- An access with zero offset still takes place and needs to be handled.
Fixes#2437.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The following is from the Hyper-V specification v6.0b.
Cpuid leaf 0x40000003 EDX:
Bit 3: Support for physical CPU dynamic partitioning events is
available.
When Windows determines to be running under a hypervisor, it will
require this cpuid bit to be set to support dynamic CPU operations.
Cpuid leaf 0x40000004 EAX:
Bit 5: Recommend using relaxed timing for this partition. If
used, the VM should disable any watchdog timeouts that
rely on the timely delivery of external interrupts.
This bit has been figured out as required after seeing guest BSOD
when CPU hotplug bit is enabled. Race conditions seem to arise after a
hotplug operation, when a system watchdog has expired.
Closes#1799.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
error: name `GPIOInterruptDisabled` contains a capitalized acronym
Error: --> devices/src/legacy/gpio_pl061.rs:46:5
|
46 | GPIOInterruptDisabled,
| ^^^^^^^^^^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `GpioInterruptDisabled`
|
= note: `-D clippy::upper-case-acronyms` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `FinalizeTDX` contains a capitalized acronym
--> vmm/src/vm.rs:274:5
|
274 | FinalizeTDX(hypervisor::HypervisorVmError),
| ^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `FinalizeTdx`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `AcpiPMTimerDevice` contains a capitalized acronym
--> devices/src/acpi.rs:175:12
|
175 | pub struct AcpiPMTimerDevice {
| ^^^^^^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `AcpiPmTimerDevice`
|
= note: `#[warn(clippy::upper_case_acronyms)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `IORegion` contains a capitalized acronym
--> pci/src/configuration.rs:320:5
|
320 | IORegion = 0x01,
| ^^^^^^^^ help: consider making the acronym lowercase, except the initial letter (notice the capitalization): `IoRegion`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `LocalAPIC` contains a capitalized acronym
--> vmm/src/cpu.rs:197:8
|
197 | struct LocalAPIC {
| ^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `LocalApic`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `TYPE_UNKNOWN` contains a capitalized acronym
--> vm-virtio/src/lib.rs:48:5
|
48 | TYPE_UNKNOWN = 0xFF,
| ^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `Type_Unknown`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `SDT` contains a capitalized acronym
--> acpi_tables/src/sdt.rs:27:12
|
27 | pub struct SDT {
| ^^^ help: consider making the acronym lowercase, except the initial letter: `Sdt`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Drop the generic type E and use IrqRoutngEntry directly. This allows
dropping a bunch of trait bounds from code.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Their make_entry functions look the same now. Extract the logic to a
common function.
No functional change.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
There's no need to have the code creating the passthrough_device being
duplicated since we can factorize it in a function used in both cases
(both cold plugged and hot plugged devices VFIO devices).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend and use the existing DeviceTree to retrieve useful information
related to PCI devices. This removes the duplication with pci_devices
field which was internal to the DeviceManager.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make the code a bit clearer by changing the naming of the structure
holding the list of IRQs reserved for PCI devices. It is also modified
into an array of 32 entries since we know this is the amount of PCI
slots that is supported.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We define a new enum in order to classify PCI device under virtio or
VFIO. This is a cleaner approach than using the Any trait, and
downcasting it to find the object back.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduces a tuple holding both information needed by pci_id_list and
pci_devices.
Changes pci_devices to be a BTreeMap of this new tuple.
Now that pci_devices holds the information needed from pci_id_list,
pci_id_list is no longer needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for further factorization, the pci_id_list is now a
hashmap of PCI b/d/f leading to each device name.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of relying on a PCI specific device list, we use the DeviceTree
as a reference to determine if a device name is already in use or not.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Only if we have a valid API server path then create the API server. For
now this has no functional change there is a default API server path in
the clap handling but rather prepares to do so optionally.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>