Remove the hardcoded addresses.
Also remove PM_TMR_BLK as spec compliant implementation will use
X_PM_TMR_BLK over this field.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The original code uses kvm_device_attr directly outside of the
hyeprvisor crate. That leaks hypervisor details.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This requires making get/set_lapic_reg part of the type.
For the moment we cannot provide a default variant for the new type,
because picking one will be wrong for the other hypervisor, so I just
drop the test cases that requires LapicState::default().
Signed-off-by: Wei Liu <liuwe@microsoft.com>
CpuId is an alias type for the flexible array structure type over
CpuIdEntry. The type itself and the type of the element in the array
portion are tied to the underlying hypervisor.
Switch to using CpuIdEntry slice or vector directly. The construction of
CpuId type is left to hypervisors.
This allows us to decouple CpuIdEntry from hypervisors more easily.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
We only need to do this for x86 since MSHV does not have aarch64 support
yet. This reduces unnecessary code churn.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
VmState was introduced to hold hypervisor specific VM state. KVM does
not need it and MSHV does not really use it yet.
Just drop the code. It can be easily revived once there is a need.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Previously, we were assuming that every time an eventfd notified us,
there was only a single event waiting for us. This meant that if,
while one API request was being processed, two more arrived, the
second one would not be processed (until the next one arrived, when it
would be processed instead of that event, and so on). To fix this,
make sure we're processing the number of API and debug requests we've
been told have arrived, rather than just one. This is easy to
demonstrate by sending lots of API events and adding some sleeps to
make sure multiple events can arrive while each is being processed.
For other uses of eventfd, like the exit event, this doesn't matter —
even if we've received multiple exit events in quick succession, we
only need to exit once. So I've only made this change where receiving
an event is non-idempotent, i.e. where it matters that we process the
event the right number of times.
Technically, reset requests are also non-idempotent — there's an
observable difference between a VM resetting once, and a VM resetting
once and then immediately resetting again. But I've left that alone
for now because two resets in immediate succession doesn't sound like
something anyone would ever want to me.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Function `system_registers` took mutable vector reference and modified
the vector content. Now change the definition to `get/set` style.
And rename to `get/set_sys_regs` to align with other functions.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
On AArch64, the function `core_registers` and `set_core_registers` are
the same thing of `get/set_regs` on x86_64. Now the names are aligned.
This will benefit supporting `gdb`.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The VM specific signal (currently only SIGWINCH) should only be handled
when the VM is running.
The generic VMM signals (SIGINT and SIGTERM) need handling at all times.
Split the signal handling into two separate threads which have differing
lifetimes.
Tested by:
1.) Boot full VM and check resize handling (SIGWINCH) works & sending
SIGTERM leads to cleanup (tested that API socket is removed.)
2.) Start without a VM and send SIGTERM/SIGINT and observe cleanup (API
socket removed)
3.) Boot full VM, delete VM and observe 2.) holds.
4.) Boot full VM, delete VM, recreate VM and observe 1.) holds.
Fixes: #4269
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
And along with virtio-queue, we must also bump vhost-user-backend from
0.3.0 to 0.5.0 (since it relies on virtio-queue 0.4.0).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The Linux kernel now checks for this before marking CPUs as
hotpluggable:
commit aa06e20f1be628186f0c2dcec09ea0009eb69778
Author: Mario Limonciello <mario.limonciello@amd.com>
Date: Wed Sep 8 16:41:46 2021 -0500
x86/ACPI: Don't add CPUs that are not online capable
A number of systems are showing "hotplug capable" CPUs when they
are not really hotpluggable. This is because the MADT has extra
CPU entries to support different CPUs that may be inserted into
the socket with different numbers of cores.
Starting with ACPI 6.3 the spec has an Online Capable bit in the
MADT used to determine whether or not a CPU is hotplug capable
when the enabled bit is not set.
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html?#local-apic-flags
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This check is new in the beta version of clippy and exists to avoid
potential deadlocks by highlighting when the test in an if or for loop
is something that holds a lock. In many cases we would need to make
significant refactorings to be able to pass this check so disable in the
affected crates.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: you are deriving `PartialEq` and can implement `Eq`
--> vmm/src/serial_manager.rs:59:30
|
59 | #[derive(Debug, Clone, Copy, PartialEq)]
| ^^^^^^^^^ help: consider deriving `Eq` as well: `PartialEq, Eq`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Tested:
1. SIGTERM based
2. VM shutdown/poweroff
3. Injected VM boot failure after calling Vm::setup_tty()
Fixes: #4248
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The snapshots are stored in a BTree which is ordered however as the ids
are strings lexical ordering places "11" ahead of "2". So encode the
vCPU id with zero padding so it is lexically sorted.
This fixes issues with CPU restore on aarch64.
See: #4239
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When restoring a VM, the restore codepath will take care of mapping the
MMIO regions based on the information from the snapshot, rather than
having the mapping being performed during device creation.
When the device is created, information such as which BARs contain the
MSI-X tables are missing, preventing to perform the mapping of the MMIO
regions.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on recent KVM host patches (merged in Linux 5.16), it's forbidden
to call into KVM_SET_CPUID2 after the first successful KVM_RUN returned.
That means saving CPU states during the pause sequence, and restoring
these states during the resume sequence will not work with the current
design starting with kernel version 5.16.
In order to solve this problem, let's simply move the save/restore logic
to the snapshot/restore sequences rather than the pause/resume ones.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The vCPU is created and set after all the devices on a VM's boot.
There's no reason to follow a different order on the restore codepath as
this could cause some unexpected behaviors.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Combined the `GicDevice` struct in `arch` crate and the `Gic` struct in
`devices` crate.
After moving the KVM specific code for GIC in `arch`, a very thin wapper
layer `GicDevice` was left in `arch` crate. It is easy to combine it
with the `Gic` in `devices` crate.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
In order to ensure that the virtio device thread is spawned from the vmm
thread we use an asynchronous activation mechanism for the virtio
devices. This change optimises that code so that we do not need to
iterate through all virtio devices on the platform in order to find the
one that requires activation. We solve this by creating a separate short
lived VirtioPciDeviceActivator that holds the required state for the
activation (e.g. the clones of the queues) this can then be stored onto
the device manager ready for asynchronous activation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Based on the newly added guest_debug feature, this patch adds http
endpoint support.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Co-authored-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The crash tool use a special note segment which named 'QEMU' to
analyze kaslr info and so on. If we don't add the 'QEMU' note
segment, crash tool can't find linux version to move on.
For now, the most convenient way is to add 'QEMU' note segment to
make crash tool happy.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Guest memory is needed for analysis in crash tool, so save it
for coredump.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Co-authored-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's useful to dump the guest, which named coredump so that crash
tool can be used to analysize it when guest hung up.
Let's add GuestDebuggable trait and Coredumpxxx error to support
coredump firstly.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Co-authored-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The error message incorrectly said that the user was trying to combine
cache_size without dax whereas it is only usuable with dax.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove the code from the DeviceManager that prepares the DAX cache since
the functionality has now been removed.
Fixes: #3889
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
`GicDevice` trait was defined for the common part of GicV3 and ITS.
Now that the standalone GicV3 do not exist, `GicDevice` is not needed.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This reverts commit f160572f9d.
There has been increased flakiness around the live migration tests since
this was merged. Speculatively reverting to see if there is increased
stability.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to ensure that the virtio device thread is spawned from the vmm
thread we use an asynchronous activation mechanism for the virtio
devices. This change optimises that code so that we do not need to
iterate through all virtio devices on the platform in order to find the
one that requires activation. We solve this by creating a separate short
lived VirtioPciDeviceActivator that holds the required state for the
activation (e.g. the clones of the queues) this can then be stored onto
the device manager ready for asynchronous activation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Latest cargo beta version raises warnings about unused macro rules.
Simply remove them to fix the beta build.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There is no need to include serde_derive separately,
as it can be specified as serde feature instead.
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Explicitly re-export types from the hypervisor specific modules. This
makes it much clearer what the common functionality that is exposed is.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
And thus only export what is necessary through a `pub use`. This is
consistent with some of the other modules and makes it easier to
understand what the external interface of the hypervisor crate is.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By taking advantage of the fact that IrqRoutingEntry is exported by the
hypervisor crate (that is typedef'ed to the hypervisor specific version)
then the code for handling the MsiInterruptManager can be simplified.
This is particularly useful if in this future it is not a typedef but
rather a wrapper type.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This removes the requirement to leak as many datastructures from the
hypervisor crate into the vmm crate.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The trait and functionality is about operations on the VM rather than
the VMM so should be named appropriately. This clashed with with
existing struct for the concrete implementation that was renamed
appropriately.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever going through the codepath of loading a RAW firmware, we always
add an extra RAM region to the guest memory through the memory manager.
But we must be careful to use the updated guest memory rather than a
previous reference that wasn't containing the new region, as this can
lead to the following error:
VmBoot(FirmwareLoad(InvalidGuestAddress(GuestAddress(4290772992))))
This is fixed by the current patch, getting the latest reference onto
the guest memory from the memory manager right after the new region has
been added.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is required when hot-removing a vfio-user device. Details code path
below:
Thread 6 "vcpu0" received signal SIGSYS, Bad system call.
[Switching to Thread 0x7f8196889700 (LWP 2358305)]
0x00007f8196dae7ab in shutdown () at ../sysdeps/unix/syscall-template.S:78
78 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
(gdb) bt
0x00007f8196dae7ab in shutdown () at ../sysdeps/unix/syscall-template.S:78
0x000056189240737d in std::sys::unix::net::Socket::shutdown ()
at library/std/src/sys/unix/net.rs:383
std::os::unix::net::stream::UnixStream::shutdown () at library/std/src/os/unix/net/stream.rs:479
0x000056189210e23d in vfio_user::Client::shutdown (self=0x7f8190014300)
at vfio_user/src/lib.rs:787
0x00005618920b9d02 in <pci::vfio_user::VfioUserPciDevice as core::ops::drop::Drop>::drop (
self=0x7f819002d7c0) at pci/src/vfio_user.rs:551
0x00005618920b8787 in core::ptr::drop_in_place<pci::vfio_user::VfioUserPciDevice> ()
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920b92e3 in core::ptr::drop_in_place<core::cell::UnsafeCell<dyn pci::device::PciDevice>>
() at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920b9362 in core::ptr::drop_in_place<std::sync::mutex::Mutex<dyn pci::device::PciDevice>> () at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x00005618920d8a3e in alloc::sync::Arc<T>::drop_slow (self=0x7f81968852b8)
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/sync.rs:1092
0x00005618920ba273 in <alloc::sync::Arc<T> as core::ops::drop::Drop>::drop (self=0x7f81968852b8)
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/alloc/src/sync.rs:1688
0x00005618920b76fb in core::ptr::drop_in_place<alloc::sync::Arc<std::sync::mutex::Mutex<dyn pci::device::PciDevice>>> ()
at /rustc/7737e0b5c4103216d6fd8cf941b7ab9bdbaace7c/library/core/src/ptr/mod.rs:188
0x0000561891b5e47d in vmm::device_manager::DeviceManager::eject_device (self=0x7f8190009600,
pci_segment_id=0, device_id=3) at vmm/src/device_manager.rs:4000
0x0000561891b674bc in <vmm::device_manager::DeviceManager as vm_device:🚌:BusDevice>::write (
self=0x7f8190009600, base=70368744108032, offset=8, data=&[u8](size=4) = {...})
at vmm/src/device_manager.rs:4625
0x00005618921927d5 in vm_device:🚌:Bus::write (self=0x7f8190006e00, addr=70368744108040,
data=&[u8](size=4) = {...}) at vm-device/src/bus.rs:235
0x0000561891b72e10 in <vmm::vm::VmOps as hypervisor::vm::VmmOps>::mmio_write (
self=0x7f81900097b0, gpa=70368744108040, data=&[u8](size=4) = {...}) at vmm/src/vm.rs:378
0x0000561892133ae2 in <hypervisor::kvm::KvmVcpu as hypervisor::cpu::Vcpu>::run (
self=0x7f8190013c90) at hypervisor/src/kvm/mod.rs:1114
0x0000561891914e85 in vmm::cpu::Vcpu::run (self=0x7f819001b230) at vmm/src/cpu.rs:348
0x000056189189f2cb in vmm::cpu::CpuManager::start_vcpu::{{closure}}::{{closure}} ()
at vmm/src/cpu.rs:953
Signed-off-by: Bo Chen <chen.bo@intel.com>
Since both Net and vhost_user::Net implement the Migratable trait, we
can factorize the common part to simplify the code related to the net
creation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since both Block and vhost_user::Blk implement the Migratable trait, we
can factorize the common part to simplify the code related to the disk
creation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend the validate() function for both DiskConfig and NetConfig so that
we return an error if a vhost-user-block or vhost-user-net device is
expected to be placed behind the virtual IOMMU. Since these devices
don't support this feature, we can't allow iommu to be set to true in
these cases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is a cleaner approach to handling the I/O port write to 0x80.
Whilst doing this also use generate the timestamp at the start of the VM
creation. For consistency use the same timestamp for the ARM equivalent.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We don't use the VmmOps trait directly for manipulating memory in the
core of the VMM as it's really designed for the MSHV crate to handle
instruction decoding. As I plan to make this trait MSHV specific to
allow reduced locking for MMIO and PIO handling when running on KVM this
use should be removed.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
To correctly map MMIO regions to the guest, we will need to wait for valid
MMIO region information which is generated from 'PciDevice::allocate_bars()'
(as a part of 'DeviceManager::add_pci_device()').
Signed-off-by: Bo Chen <chen.bo@intel.com>
For devices that cannot be named by the user use the "__" prefix to
identify them as internal devices. Check that any identifiers provided
in the config do not clash with those internal names. This prevents the
user from creating a disk such as "__serial" which would then cause a
failure in unpredictable manner.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever a device (virtio, vfio, vfio-user or vdpa) is hotplugged, we
must verify the provided identifier is unique, otherwise we must return
an error.
Particularly, this will prevent issues with identifiers for serial,
console, IOAPIC, balloon, rng, watchdog, iommu and gpio since all of
these are hardcoded by the VMM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
All hotpluggable devices were properly removed from the VmConfig when a
remove-device command was issued, except for the "fs" type. Fix this
lack of support as it is causing the integration tests to fail with the
recent addition of verifying that identifiers are unique.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The device identifiers generated from the DeviceManager were not
guaranteed to be unique since they were not taking the list of
identifiers provided through the configuration.
By returning the list of unique identifiers from the configuration, and
by providing it to the DeviceManager, the generation of new identifiers
can rely both on the DeviceTree and the list of IDs from the
configuration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The socket will safely deleted on shutdown and so it is not necessary to
delete the API socket when starting the HTTP server.
Fixes: #4026
Signed-off-by: LiHui <andrewli@kubesphere.io>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Start loading the kernel as possible in the VM in a separate thread.
Whilst it is loading other work can be carried out such as initialising
the devices.
The biggest performance improvement is seen with a more complex set of
devices. If using e.g. four virtio-net devices then the time to start the
kernel improves by 20-30ms. With the simplest configuration the
improvement was of the order of 2-3ms.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the same code for generating the kernel command line to be
used on both aarch64 and x86_64 when the latter starts loading the
kernel in asynchronously.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is not required for x86_64 and maintains a tight coupling between
kernel loading and the DeviceManager.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This reverts commit 87eed369cd.
The reason we're reverting this is that OpenAPI Specification[0] doesn't
know how to deal with unsigned types. :-/
Right now the best to do is keep it as it's, as an int64, and try to fix
OpenAPI, or even switch to swagger, as the latter knows how to properly
deal with those. However, switching to swagger is far from being an 1:1
transition and will require time to experiment, thus reverting this for
now seems the best approach.
[0]: https://github.com/OAI/OpenAPI-Specification/blob/main/versions/3.1.0.md#data-types
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The Token Bucket fields are, on the Cloud Hypervisor side, u64.
However, we expose those as int64 in the OpenAPI YAML file.
With that in mind, let's adjust the yaml file to expose those as uint64.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This means that the automatic enabling of the virtio-iommu will also be
applied to VMs creates via the API as well as the CLI.
Fixes: #4016
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If using the ACPI based hotplug only memory can be added so if the
hotplug RAM size is the same as the boot RAM size then do not include
the memory manager DSDT entries.
Also: this change simplifies the code marginally by making the
HotplugMethod enum Copyable.
This was identified from the following perf output:
1.78% 0.00% vmm cloud-hypervisor [.] <vmm::memory_manager::MemorySlots as acpi_tables::aml::Aml>::append_aml_bytes
|
---<vmm::memory_manager::MemorySlots as acpi_tables::aml::Aml>::append_aml_bytes
<vmm::memory_manager::MemorySlot as acpi_tables::aml::Aml>::append_aml_bytes
acpi_tables::aml::Name::new
<acpi_tables::aml::Path as acpi_tables::aml::Aml>::append_aml_bytes
__libc_malloc
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
No further changes are necessary that adding a #[derive(Error)] as there
is a manual implementation of Display.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Extend VfioCommon structure to own the MSI interrupt manager. This will
be useful for implementing the restore code path.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This carries a string that is exposed via DMI/SMBIOS and is particularly
useful for cloud-init initialisation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a single enum member for representing errors from the internal API.
This avoids the ugly duplication of the API call name in the error
message:
e.g.
$ target/debug/ch-remote --api-socket /tmp/api resize --cpus 2
Error running command: Server responded with an error: InternalServerError: VmResize(VmResize(CpuManager(DesiredVCpuCountExceedsMax)))
Becomes:
$ target/debug/ch-remote --api-socket /tmp/api resize --cpus 2
Error running command: Server responded with an error: InternalServerError: ApiError(VmResize(CpuManager(DesiredVCpuCountExceedsMax)))
Signed-off-by: Rob Bradford <robert.bradford@intel.com>