Assigning KVM_IRQFD (when unmasking a guest IRQ) after
KVM_SET_GSI_ROUTING can avoid kernel panic on the guest that is not
patched with commit a80ced6ea514 (KVM: SVM: fix panic on out-of-bounds
guest IRQ) on AMD systems.
Meanwhile, it is required to deassign KVM_IRQFD (when masking a guest
IRQ) before KVM_SET_GSI_ROUTING (see #3827).
Fixes: #6353
Signed-off-by: Yi Wang <foxywang@tencent.com>
Signed-off-by: Bo Chen <chen.bo@intel.com>
The 'NetConfig' may contain FDs which can't be serialized correctly, as
FDs can only be donated from another process via a Unix domain socket
with `SCM_RIGHTS`. To avoid false use of the serialized FDs, this patch
explicitly set 'NetConfig' FDs as invalid for (de)serialization.
See: #6286
Signed-off-by: Bo Chen <chen.bo@intel.com>
warning: assigning the result of `Clone::clone()` may be inefficient
--> vmm/src/device_manager.rs:4188:17
|
4188 | id = child_id.clone();
| ^^^^^^^^^^^^^^^^^^^^^ help: use `clone_from()`: `id.clone_from(child_id)`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#assigning_clones
= note: `#[warn(clippy::assigning_clones)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
A deadlock can happen from the destination VM of live upgrade or
migration due to waiting on paused device worker threads. For example,
when a serialization error happens after the `DeviceManager` struct is
restored (where all virtio device worker threads are spawned but in
paused/parked state), a deadlock will happen from
`DeviceManager::drop()`, as it blocks for waiting worker threads to
join.
This patch ensures that we wake up all device (mostly virtio) worker
threads before we block for them to join.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Currently, we only support injecting NMI for KVM guests but we can do
the same for MSHV guests as well to have feature parity.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
This commit ensures that the HttpApi thread flushes all the responses
before the application shuts down. Without this step, in case of a
VmmShutdown request the application might terminate before the
thread sends a response.
Fixes: #6247
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
On platforms where PCIe P2P is supported, inject a PCI capability into
NVIDIA GPU to indicate support.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Host data that is passed to the hypervisor. Then
the firmware includes the data in the attestation report.
The data might include any key or secret that the SevSnp guest
might need later.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The host data provided at launch. Data is passed
to the hypervisor during the completion of the
isolated import.
Host Data provided by the hypervisor during guest launch.
The firmware includes this value in all attestation
reports for the guest.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
With the nightly toolchain (2024-02-18) cargo check will flag up
redundant imports either because they are pulled in by the prelude on
earlier match.
Remove those redundant imports.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When a guest running on a terminal reboots, the sigwinch_listener
subprocess exits and a new one restarts. The parent never wait()s
for children, so the old subprocess remains as a zombie. With further
reboots, more and more zombies build up.
As there are no other children for which we want the exit status,
the easiest fix is to take advantage of the implicit reaping specified
by POSIX when we set the disposition of SIGCHLD to SIG_IGN.
For this to work, we also need to set the correct default exit signal
of SIGCHLD when using clone3() CLONE_CLEAR_SIGHAND. Unlike the fallback
fork() path, clone_args::default() initialises the exit signal to zero,
which results in a child with non-standard reaping behaviour.
Signed-off-by: Chris Webb <chris@arachsys.com>
Allow cloud-hypervisor to direct boot the bzImage kernel format using
the regular 32 bit entry point. This can share the memory and vcpu
setup with the regular PVH boot code, but requires the setup of the
'zero page'.
Signed-off-by: Stefan Nuernberger <stefan.nuernberger@cyberus-technology.de>
In case of SEV-SNP guest devices use sw-iotlb to gain access guest
memory for DMA. For that F_IOMMU/F_ACCESS_PLATFORM must be exposed in
the feature set of virtio devices.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The 'generate_ram_ranges' function currently hardcodes the assumption
that there are only 2 E820 RAM entries. This is not flexible enough to
handle vendor specific memory holes. Returning a Vec is also more
convenient for users of this function.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Since the ACPI tables are generated inside the IGVM file in case of
SEV-SNP guest. So, we don't need to generate it inside the cloud
hypervisor.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
We occasionally saw cloud-hypervisor crashed due to seccomp violations. The
coredumps showed the HTTP API thread crashing after it attempted to call
sched_yield(). The call came from rust stdlib's mpmc module, which calls
sched_yield() if several attempts to busy-wait for a condition to fulfil fall
short.
Since the system call is harmless and it comes from the stdlib, I opted to allow
all threads to call it.
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
Currently the only way to set the affinity for virtio block threads is
to boot the VM, search for the tid of each of the virtio block threads,
then set the affinity manually. This commit adds an option to pin virtio
block queues to specific host cpus (similar to pinning vcpus to host
cpus). A queue_affinity option has been added to the disk flag in
the cli to specify a mapping of queue indices to host cpus.
Signed-off-by: acarp <acarp@crusoeenergy.com>
Traditional way to configure vcpu don't work for sev-snp guests. All the
vCPU configuration for SEV-SNP guest is provided via VMSA.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
On guests with large amounts of memory, using the `prefault` option can
lead to a very long boot time. This commit implements the strategy
taken by QEMU to prefault memory in parallel using multiple threads,
decreasing the time to allocate memory for large guests by
an order of magnitude or more.
For example, this commit reduces the time to allocate memory for a
guest configured with 704 GiB of memory on 1 NUMA node using 1 GiB
hugepages from 81.44134669s to just 6.865287881s.
Signed-off-by: Sean Banko <sbanko@crusoeenergy.com>
Beta clippy fix:
warning: initializer for `thread_local` value can be made `const`
--> vmm/src/sigwinch_listener.rs:27:40
|
27 | static TX: RefCell<Option<File>> = RefCell::new(None);
| ^^^^^^^^^^^^^^^^^^ help: replace with: `const { RefCell::new(None) }`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#thread_local_initializer_can_be_made_const
= note: `#[warn(clippy::thread_local_initializer_can_be_made_const)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Beta clippy fix:
warning: this call to `as_ref.map(...)` does nothing
--> vmm/src/device_manager.rs🔢9
|
1234 | self.console_resize_pipe.as_ref().map(Arc::clone)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try: `self.console_resize_pipe.clone()`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_asref
= note: `#[warn(clippy::useless_asref)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When we import a page, we have a page with
some data or empty, empty does not mean there is no data,
it rather means it's full of zeros. We can skip writing the
data as guest memory of the page is already zeroed.
A page could be partially filled and the rest of the content is zero.
Our IGVM generation tool only fills data here if there is some data
without zeros. Rest of them are padded. We only write data
without padding and compare whether we complete writing
the buffer content. Still it's a full page and update the variable
with length of the full page.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This commit adds the debug-console (or debugcon) device to CHV. It is a
very simple device on I/O port 0xe9 supported by QEMU and BOCHS. It is
meant for printing information as easy as possible, without any
necessary configuration from the guest at all.
It is primarily interesting to OS/kernel and firmware developers as they
can produce output as soon as the guest starts without any configuration
of a serial device or similar. Furthermore, a kernel hacker might use
this device for information of type B whereas information of type A are
printed to the serial device.
This device is not used by default by Linux, Windows, or any other
"real" OS, but only by toy kernels and during firmware development.
In the CLI, it can be configured similar to --console or --serial with
the --debug-console parameter.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This patch bumps the following crates, including `kvm-bindings@0.7.0`*,
`kvm-ioctls@0.16.0`**, `linux-loader@0.11.0`, `versionize@0.2.0`,
`versionize_derive@0.1.6`***, `vhost@0.10.0`,
`vhost-user-backend@0.13.1`, `virtio-queue@0.11.0`, `vm-memory@0.14.0`,
`vmm-sys-util@0.12.1`, and the latest of `vfio-bindings`, `vfio-ioctls`,
`mshv-bindings`,`mshv-ioctls`, and `vfio-user`.
* A fork of the `kvm-bindings` crate is being used to support
serialization of various structs for migration [1]. Also, code changes
are made to accommodate the updated `struct xsave` from the Linux
kernel. Note: these changes related to `struct xsave` break
live-upgrade.
** The new `kvm-ioctls` crate introduced breaking changes for
the `get/set_one_reg` API on `aarch64` [2], so code changes are made to
the new APIs.
*** A fork of the `versionize_derive` crate is being used to support
versionize on packed structs [3].
[1] https://github.com/cloud-hypervisor/kvm-bindings/tree/ch-v0.7.0
[2] https://github.com/rust-vmm/kvm-ioctls/pull/223
[3] https://github.com/cloud-hypervisor/versionize_derive/tree/ch-0.1.6Fixes: #6072
Signed-off-by: Bo Chen <chen.bo@intel.com>
Set the SEV control register so we know where to
start running. This register configures the
SEV feature control state on a virtual processor.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
These Default implementations either don't produce valid configs, are
no longer used outside of tests, or both.
For the tests, we can define our own local "default" values that make
the most sense for the tests, without worrying about what's
a (somewhat) sensible "global" default value.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Bumping anyhow crate from 1.0.75 to 1.0.79 will cause seccomp
failures through integration tests. Newly added backtrace support
relies on readlink and many other syscalls.
Issue noticed with test_api_http_pause_resume test, where second time
of VM PAUSE or VM RESUME prints error and causes panic.
Noticed that panic message in a thread which is not allowed to write
output triggered the issue.
So implementing Display trait for HttpError and ApiError enums to avoid
adding many syscalls to seccomp filter section.
Signed-off-by: Ravi kumar Veeramally <ravikumar.veeramally@intel.com>
On hosts with >256 cpus, setting the cpu affinity to a host cpu index
>255 will return an error because type of `host_cpu` is `u8`.
This commit changes the type of `host_cpu` to `usize` to remove this
limitation.
Signed-off-by: Sean Banko <sbanko@crusoeenergy.com>