With the nightly toolchain (2024-02-18) cargo check will flag up
redundant imports either because they are pulled in by the prelude on
earlier match.
Remove those redundant imports.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When a guest running on a terminal reboots, the sigwinch_listener
subprocess exits and a new one restarts. The parent never wait()s
for children, so the old subprocess remains as a zombie. With further
reboots, more and more zombies build up.
As there are no other children for which we want the exit status,
the easiest fix is to take advantage of the implicit reaping specified
by POSIX when we set the disposition of SIGCHLD to SIG_IGN.
For this to work, we also need to set the correct default exit signal
of SIGCHLD when using clone3() CLONE_CLEAR_SIGHAND. Unlike the fallback
fork() path, clone_args::default() initialises the exit signal to zero,
which results in a child with non-standard reaping behaviour.
Signed-off-by: Chris Webb <chris@arachsys.com>
Allow cloud-hypervisor to direct boot the bzImage kernel format using
the regular 32 bit entry point. This can share the memory and vcpu
setup with the regular PVH boot code, but requires the setup of the
'zero page'.
Signed-off-by: Stefan Nuernberger <stefan.nuernberger@cyberus-technology.de>
In case of SEV-SNP guest devices use sw-iotlb to gain access guest
memory for DMA. For that F_IOMMU/F_ACCESS_PLATFORM must be exposed in
the feature set of virtio devices.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The 'generate_ram_ranges' function currently hardcodes the assumption
that there are only 2 E820 RAM entries. This is not flexible enough to
handle vendor specific memory holes. Returning a Vec is also more
convenient for users of this function.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Since the ACPI tables are generated inside the IGVM file in case of
SEV-SNP guest. So, we don't need to generate it inside the cloud
hypervisor.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
We occasionally saw cloud-hypervisor crashed due to seccomp violations. The
coredumps showed the HTTP API thread crashing after it attempted to call
sched_yield(). The call came from rust stdlib's mpmc module, which calls
sched_yield() if several attempts to busy-wait for a condition to fulfil fall
short.
Since the system call is harmless and it comes from the stdlib, I opted to allow
all threads to call it.
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
Currently the only way to set the affinity for virtio block threads is
to boot the VM, search for the tid of each of the virtio block threads,
then set the affinity manually. This commit adds an option to pin virtio
block queues to specific host cpus (similar to pinning vcpus to host
cpus). A queue_affinity option has been added to the disk flag in
the cli to specify a mapping of queue indices to host cpus.
Signed-off-by: acarp <acarp@crusoeenergy.com>
Traditional way to configure vcpu don't work for sev-snp guests. All the
vCPU configuration for SEV-SNP guest is provided via VMSA.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
On guests with large amounts of memory, using the `prefault` option can
lead to a very long boot time. This commit implements the strategy
taken by QEMU to prefault memory in parallel using multiple threads,
decreasing the time to allocate memory for large guests by
an order of magnitude or more.
For example, this commit reduces the time to allocate memory for a
guest configured with 704 GiB of memory on 1 NUMA node using 1 GiB
hugepages from 81.44134669s to just 6.865287881s.
Signed-off-by: Sean Banko <sbanko@crusoeenergy.com>
Beta clippy fix:
warning: initializer for `thread_local` value can be made `const`
--> vmm/src/sigwinch_listener.rs:27:40
|
27 | static TX: RefCell<Option<File>> = RefCell::new(None);
| ^^^^^^^^^^^^^^^^^^ help: replace with: `const { RefCell::new(None) }`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#thread_local_initializer_can_be_made_const
= note: `#[warn(clippy::thread_local_initializer_can_be_made_const)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Beta clippy fix:
warning: this call to `as_ref.map(...)` does nothing
--> vmm/src/device_manager.rs🔢9
|
1234 | self.console_resize_pipe.as_ref().map(Arc::clone)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try: `self.console_resize_pipe.clone()`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_asref
= note: `#[warn(clippy::useless_asref)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When we import a page, we have a page with
some data or empty, empty does not mean there is no data,
it rather means it's full of zeros. We can skip writing the
data as guest memory of the page is already zeroed.
A page could be partially filled and the rest of the content is zero.
Our IGVM generation tool only fills data here if there is some data
without zeros. Rest of them are padded. We only write data
without padding and compare whether we complete writing
the buffer content. Still it's a full page and update the variable
with length of the full page.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This commit adds the debug-console (or debugcon) device to CHV. It is a
very simple device on I/O port 0xe9 supported by QEMU and BOCHS. It is
meant for printing information as easy as possible, without any
necessary configuration from the guest at all.
It is primarily interesting to OS/kernel and firmware developers as they
can produce output as soon as the guest starts without any configuration
of a serial device or similar. Furthermore, a kernel hacker might use
this device for information of type B whereas information of type A are
printed to the serial device.
This device is not used by default by Linux, Windows, or any other
"real" OS, but only by toy kernels and during firmware development.
In the CLI, it can be configured similar to --console or --serial with
the --debug-console parameter.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
This patch bumps the following crates, including `kvm-bindings@0.7.0`*,
`kvm-ioctls@0.16.0`**, `linux-loader@0.11.0`, `versionize@0.2.0`,
`versionize_derive@0.1.6`***, `vhost@0.10.0`,
`vhost-user-backend@0.13.1`, `virtio-queue@0.11.0`, `vm-memory@0.14.0`,
`vmm-sys-util@0.12.1`, and the latest of `vfio-bindings`, `vfio-ioctls`,
`mshv-bindings`,`mshv-ioctls`, and `vfio-user`.
* A fork of the `kvm-bindings` crate is being used to support
serialization of various structs for migration [1]. Also, code changes
are made to accommodate the updated `struct xsave` from the Linux
kernel. Note: these changes related to `struct xsave` break
live-upgrade.
** The new `kvm-ioctls` crate introduced breaking changes for
the `get/set_one_reg` API on `aarch64` [2], so code changes are made to
the new APIs.
*** A fork of the `versionize_derive` crate is being used to support
versionize on packed structs [3].
[1] https://github.com/cloud-hypervisor/kvm-bindings/tree/ch-v0.7.0
[2] https://github.com/rust-vmm/kvm-ioctls/pull/223
[3] https://github.com/cloud-hypervisor/versionize_derive/tree/ch-0.1.6Fixes: #6072
Signed-off-by: Bo Chen <chen.bo@intel.com>
Set the SEV control register so we know where to
start running. This register configures the
SEV feature control state on a virtual processor.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
These Default implementations either don't produce valid configs, are
no longer used outside of tests, or both.
For the tests, we can define our own local "default" values that make
the most sense for the tests, without worrying about what's
a (somewhat) sensible "global" default value.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Bumping anyhow crate from 1.0.75 to 1.0.79 will cause seccomp
failures through integration tests. Newly added backtrace support
relies on readlink and many other syscalls.
Issue noticed with test_api_http_pause_resume test, where second time
of VM PAUSE or VM RESUME prints error and causes panic.
Noticed that panic message in a thread which is not allowed to write
output triggered the issue.
So implementing Display trait for HttpError and ApiError enums to avoid
adding many syscalls to seccomp filter section.
Signed-off-by: Ravi kumar Veeramally <ravikumar.veeramally@intel.com>
On hosts with >256 cpus, setting the cpu affinity to a host cpu index
>255 will return an error because type of `host_cpu` is `u8`.
This commit changes the type of `host_cpu` to `usize` to remove this
limitation.
Signed-off-by: Sean Banko <sbanko@crusoeenergy.com>
Uses of the old ApiRequest enum conflated two different concerns:
identifying an API request endpoint, and storing data for an API
request. This led to ApiRequest values being passed around with junk
data just to communicate a request type, which forced all API request
body types to implement Default, which in some cases doesn't make any
sense — what's the "default" path for a vhost-user socket? The
nonsensical Default values have led to tests relying on being able to
use nonsensical data, which is an impediment to adding better
validation for these types.
Rather than having API request types be represented by an enum, which
has to carry associated body data everywhere it's used, it makes more
sense to represent API request types as trait objects. These can have
an associated type for the type of the request body, and this makes it
possible to pass API request types and data around as siblings in a
type-safe way without forcing them into a single value even where it
doesn't make sense. Trait objects also give us dynamic dispatch,
which lets us get rid of several large match blocks.
To keep it possible to fuzz the HTTP API, all the Vmm methods called
by the HTTP API are pulled out into a trait, so the fuzzer can provide
its own stub implementation of the VMM.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
The VIRTIO specification[1] says:
> The upper 32 bits of the CID are reserved and zeroed.
We should therefore not allow the user to supply a VSOCK CID with
those bits set. To accomplish this, limit the public API of the
virtio-vsock device to only accept 32-bit CIDs, while still using
64-bit CIDs internally since that's how virtio-vsock works.
[1]: https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-4400004
Signed-off-by: Alyssa Ross <hi@alyssa.is>
I accidentally ran a VM with CID 2 (VMADDR_CID_HOST), and very strange
and difficult to debug behavior ensued. I don't think a virtio-vsock
device should be allowed to have any of the special CIDs
(VMADDR_CID_ANY, VMADDR_CID_HYPERVISOR, VMADDR_CID_LOCAL, VMADDR_CID_HOST).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Complete the isolated import, telling the
Microsoft hypervisor that import is done so that
MSHV can issue SNP_LAUNCH_FINISH command.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Import all the isolated pages after parsing is
done on the iGVM file. Hypervisor adds those
pages for PSP measurement(part of the hashing).
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Add a 'rate_limit_groups' field to VmConfig that defines a set of
named RateLimiterGroups.
When the 'rate_limit_group' field of DiskConfig is defined, all
virtio-blk queues will be rate-limited by a shared RateLimiterGroup.
The lifecycle of all RateLimiterGroups is tied to the Vm.
A RateLimiterGroup may exist even if no Disks are configured to use
the RateLimiterGroup. Disks may be hot-added or hot-removed from the
RateLimiterGroup.
When the 'rate_limiter' field of DiskConfig is defined, we construct
an anonymous RateLimiterGroup whose lifecycle is tied to the Disk.
This is primarily done for api backwards compatability. Importantly,
the behavior is not the same! This implementation rate_limits the
aggregate bandwidth / iops of an individual disk rather than the
bandwidth / iops of an individual queue of a disk.
When neither the 'rate_limit_group' or the 'rate_limiter' fields of
DiskConfig is defined, the Disk is not rate-limited.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
This PR addresses a bug in which the cpu topology of a guest
with non power-of-two number of cores is incorrect. For example,
in some contexts, a virtual machine with 2-sockets and 12-cores
will incorrectly believe that 16 cores are on socket 1 and 8
cores are on socket 2. In other cases, common topology enumeration
software such as hwloc will crash.
The root of the problem was the way that cloud-hypervisor generates
apic_id. On x86_64, the (x2) apic_id embeds information about cpu
topology. The cpuid instruction is primarily used to discover the
number of sockets, dies, cores, threads, etc. Using this information,
the (x2) apic_id is masked to determine which {core, die, socket} the
cpu is on. When the cpu topology is not a power of two
(e.g. a 12-core machine), this requires non-contiguous (x2) apic_id.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
For SEV-SNP guests we need to provide the extended memory. It follows a
very simple layout and very similar to other x86 guests.
First segment: [HIGH_RAM_START - MEM_32BIT_RESERVED_START]
PCI hole: [MEM_32BIT_RESERVED_START - RAM_64BIT_START]
Second segment: [RAM_64BIT_START - RAM_END]
Fixes#5993
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
There is no requirement to call copy_from_slice, since all the member
variables are identical and we can directly assign them value.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Currently there are some inconsistencies in Cargo.toml which is causing
the following warnings during the build process:
Error parsing Cargo.toml manifest, fallback to caching entire file:
Invalid TOML document: expected key-value, found comma
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
vmm: Add igvm module and loader module
Add a separate module named igvm to the vmm crate
with definitions to parse and load igvm to the guest memory.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch adds igvm to the Vm config and params as well as
the command line argument to pass igvm file to load into
guest memory. The file must maintain the IGVM format.
The CLI option is featured guarded by igvm feature gate.
The IGVM(Independent Guest Virtual Machine) file format
is designed to encapsulate all information required to
launch a virtual machine on any given virtualization stack,
with support for different isolation technologies such as
AMD SEV-SNP and Intel TDX.
At a conceptual level, this file format is a set of commands created
by the tool that generated the file, used by the loader to construct
the initial guest state. The file format also contains measurement
information that the underlying platform will use to confirm that
the file was loaded correctly and signed by the appropriate authorities.
The IGVM file is generated by the tool:
https://github.com/microsoft/igvm-tooling
The IGVM file is parsed by the following crates:
https://github.com/microsoft/igvm
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This commit changes existing behavior of named TAP interfaces.
When booting a VM with configuration for a named TAP interface,
cloud-hypervisor will create the interface and apply a given
IP configuration to that interface. If the named interface
already exists on the system, the configuration is NOT overwritten.
Setting the ip and netmask fields in a tap interface configuration
for a named tap interface now works by handing this configuration
to the virtio_devices::Net object when it is created with a name.
This commit also touches net_util to make sure that the ip configuration
of existing TAP interfaces is not modified with ip or netmask handed to
open_tap.
Signed-off-by: Markus Sütter <markus.suetter@secunet.com>
We found that it's slow to load JSON when reading snap files. As
described in [1], using from_slice instead of from_reader can fix
this.
Also, fix the error type being returned.
1. https://github.com/serde-rs/json/issues/160
Signed-off-by: Yi Wang <foxywang@tencent.com>
Cloud Hypovrisor supports legacy serial device and virito console device
for VMs. Using legacy serial device, CH can capture full VM console logs,
but its implementation is based on KVM PIO emulation and has poor
performance. Using the virtio console device, the VM console logs will
be sent to CH through the virtio ring, the performance is better, but CH
will only capture the VM console logs after the virtio console device is
initialized, the VM early startup logs will be discarded.
This patch provides a way to enable both the legacy serial device and the
virtio console device as a TTY mode by setting the leagcy serial port as
the VM's early printk device and setting the virtio console as the VM's
main console device.
Then CH can capture early boot logs from the legacy serial device and
capture later logs from the virito console device with better performance.
Signed-off-by: Yong He <alexyonghe@tencent.com>
The seccompiler v0.4.0 started to use `seccomp` syscall instead of the
`prctl` syscall. Also, threads for virtio-deivces should not need any of
these syscalls anyway.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch fixes following warnings:
error: boolean to int conversion using if
--> vmm/src/vm.rs:866:42
|
| .create_vm_with_type(if sev_snp_enabled.into() {
| __________________________________________^
| | 1 // SEV_SNP_ENABLED
| | } else {
| | 0 // SEV_SNP_DISABLED
| | })
| |_____________________^ help: replace with from: `u64::from(sev_snp_enabled.into())`
|
= note: `-D clippy::bool-to-int-with-if` implied by `-D warnings`
= note: `sev_snp_enabled.into() as u64` or `sev_snp_enabled.into().into()` can also be valid options
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#bool_to_int_with_if
error: useless conversion to the same type: `bool`
--> vmm/src/vm.rs:866:45
|
| .create_vm_with_type(if sev_snp_enabled.into() {
| ^^^^^^^^^^^^^^^^^^^^^^ help: consider removing `.into()`: `sev_snp_enabled`
|
= note: `-D clippy::useless-conversion` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_conversion
error: could not compile `vmm` due to 2 previous errors
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Partially revert 111225a2a5
and add the new dbus and pvpanic arguments.
As we are switching back to clap observe the following changes.
A few examples:
1. `-v -v -v` needs to be written as`-vvv`
2. `--disk D1 --disk D2` and others need to be written as `--disk D1 D2`.
3. `--option value` needs to be written as `--option=value.`
Change integration tests to adapt to the breaking changes.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Signed-off-by: Ravi kumar Veeramally <ravikumar.veeramally@intel.com>
This struct contains all configuration fields that controls the way how
we generate CPUID for the guest on x86_64. This allows cleaner extension
when adding new configuration fields.
Signed-off-by: Bo Chen <chen.bo@intel.com>
The lock to `vm_config` is held for accessing `cpus.kvm_hyperv` passing
as a reference to `arch::generate_common_cpuid()`, so acquiring the same
lock again while calling to the same function is a deadlock.
Fixes: 3793ffe888
Reported-by: Yi Wang <foxywang@tencent.com>
Signed-off-by: Bo Chen <chen.bo@intel.com>
As part of this initialization for a SEV-SNP VM on MSHV, it is required
that we transition the guest state to secure state using partition
hypercall. This implies all the created VPs will transition to secure
state and could access the guest encrypted memory.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Cloud-Hypervisor takes a path for Unix socket, where it will listen
on. Users can connect to the other end of the socket and access serial
port on the guest.
"--serial socket=/path/to/socket" is the cmdline option to pass to
cloud-hypervisor.
Users can use socat like below to access guest's serial port once the
guest starts to boot:
socat -,crnl UNIX-CONNECT:/path/to/socket
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
The 'derive' feature of `zerocopy` crate now is optional and requires to
be enabled explicitly [1]. Also, a version bump on `acpi_tables` is
needed to reply on a single version of `zerocopy` to avoid compilation
errors.
[1] https://github.com/google/zerocopy/pull/176
Signed-off-by: Bo Chen <chen.bo@intel.com>
Include the TSC frequency as part of the KVM state so that it will be
restored at the destination.
This ensures migration works correctly between hosts that have a
different TSC frequency if the guest is running with TSC as the source
of timekeeping.
Fixes: #5786
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
EntryPoint had an optional entry_addr, but there is no usage of this
struct that makes it necessary that the address is optional.
Remove the Option to avoid being able to express things that are not
useful.
Signed-off-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>
This fixes all typos found by the typos utility with respect to the config file.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
Update to the latest vm-memory and all the crates that also depend upon
it.
Fix some deprecation warnings.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
This feature flag gates the development for SEV-SNP enabled guest.
Also add a helper function to identify if SNP should be enabled for the
guest.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
This commit builds on top of the `Monitor::subscribe` function and
makes it possible to broadcast events published from `event-monitor`
over D-Bus.
The broadcasting functionality is enabled if the D-Bus API is enabled
and users who wish to also enable the file based `event-monitor` can do
so with the CLI arg `--event-monitor`.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
warning: this argument is a mutable reference, but not used mutably
--> vmm/src/sigwinch_listener.rs:121:38
|
121 | fn set_foreground_process_group(tty: &mut File) -> io::Result<()> {
| ^^^^^^^^^ help: consider changing to: `&File`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_ref_mut
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
warning: this argument is a mutable reference, but not used mutably
--> vmm/src/device_manager.rs:1908:35
|
1908 | fn set_raw_mode(&mut self, f: &mut dyn AsRawFd) -> vmm_sys_util::errno::Result<()> {
| ^^^^^^^^^^^^^^^^ help: consider changing to: `&dyn AsRawFd`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_pass_by_ref_mut
= note: `#[warn(clippy::needless_pass_by_ref_mut)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Add pending removed vcpu check according to VcpuState.removing, which
can avoid cloud hypervisor hangup during continual vcpu resize.
Fix#5419
Signed-off-by: Yi Wang <foxywang@tencent.com>
This patch modifies `event_monitor` to ensure that concurrent access to
`event_log` from multiple threads is safe. Previously, the `event_log`
function would acquire a reference to the event log file and write
to it without doing any synchronization, which made it prone to
data races. This issue likely went under the radar because the
relevant `SAFETY` comment on the unsafe block was incomplete.
The new implementation spawns a dedicated thread named `event-monitor`
solely for writing to the file. It uses the MPMC channel exposed by
`flume` to pass messages to the `event-monitor` thread. Since
`flume::Sender<T>` implements `Sync`, it is safe for multiple threads
to share it and send messages to the `event-monitor` thread.
This is not possible with `std::sync::mpsc::Sender<T>` since it's
`!Sync`, meaning it is not safe for it to be shared between different
threads.
The `event_monitor::set_monitor` function now only initializes
the required global state and returns an instance of the
`Monitor` struct. This decouples the actual logging logic from the
`event_monitor` crate. The `event-monitor` thread is then spawned by
the `vmm` crate.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
With the addition of the spinning waiting for the exit event to be
received in the CMOS device a regression was introduced into the CMOS
fuzzer. Since there is nothing to receive the event in the fuzzer and
there is nothing to update the bit the that the device is looping on;
introducing an infinite loop.
Use an Option<> type so that when running the device in the fuzzer no
Arc<AtomicBool> is provided effectively disabling the spinning logic.
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=61165
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
The reset system is asynchronous with an I/O event (PIO or MMIO) for
ACPI/i8042/CMOS triggering a write to the reset_evt event handler. The
VMM thread will pick up this event on the VMM main loop and then trigger
a shutdown in the CpuManager. However since there is some delay between
the CPU threads being marked to be killed (through the
CpuManager::cpus_kill_signalled bool) it is possible for the guest vCPU
that triggered the exit to be re-entered when the vCPU KVM_RUN is called
after the I/O exit is completed.
This is undesirable and in particular the Linux kernel will attempt to
jump to real mode after a CMOS based exit - this is unsupported in
nested KVM on AMD on Azure and will trigger an error in KVM_RUN.
Solve this problem by spinning in the device that has triggered the
reset until the vcpus_kill_signalled boolean has been updated
indicating that the VMM thread has received the event and called
CpuManager::shutdown(). In particular if this bool is set then the vCPU
threads will not re-enter the guest.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Split interrupt source group restore into two steps, first restore
the irqfd for each interrupt source entry, and second restore the
GSI routing of the entire interrupt source group.
This patch will reduce restore latency of interrupt source group,
and in a 200-concurrent restore test, the patch reduced the
average IOAPIC restore time from 15ms to 1ms.
Signed-off-by: Yong He <alexyonghe@tencent.com>
If the VMM is not already paused then pause the VM prior to executing
the coredump and then resume it after. If the VM is already paused then
the original state is maintained.
Signed-off-by: Yi Wang <foxywang@tencent.com>
Add MSHV_CREATE_DEVICE, MSHV_SET_DEVICE_ATTR ioctls to filters. These
ioctls are required to passthrough PCI devices on mshv.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
The pause of vcpu is async now, which makes the vm pause is not
synchronised. As virtio device calls paused_sync wait() to make
sure device_manager pause synchronously, if we make vcpu pause
synchronously, the vm pause can be synchronously then. After
vm.pause() returns the vm is really paused now.
This patch adds a AtomicBool variable to mark vcpu paused state,
to make sure the pause of CpuManager is synchronised.
Signed-off-by: Yi Wang <foxywang@tencent.com>
This commit merges crates `qcow`, `vhdx` and `block_util` into the
crate `block`, which can allow `qcow` to use functions from `block_util`
without introducing a circular crate dependency.
This commit is based on crosvm implementation:
f2eecc4152
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
Using this over the LocalApic supports APIC IDs (and hence number of
vCPUs) above 256 (size increases from u8 to u32.)
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
If the VM has been configured but not yet booted, all we need to do to
support removing a device is to remove it from the config, so it will
never be created.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This fixes the valid VM config unit tests, which would otherwise fail
to deserialize their expected JSON config due to the missing "gdb" field.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
warning: usage of `Arc<T>` where `T` is not `Send` or `Sync`
--> virtio-devices/src/vsock/device.rs:376:22
|
376 | backend: Arc::new(RwLock::new(backend)),
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: consider using `Rc<T>` instead or wrapping `T` in a std::sync type like `Mutex<T>`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#arc_with_non_send_sync
= note: `#[warn(clippy::arc_with_non_send_sync)]` on by default
The vsock backend may be shared between threads, so the type `B` in
`Vsock` should be `VsockBackend` and `Sync`.
Considering that `api_receiver` and `gdb_receiver` are only used in vmm
threads, the `Arc` can be replaced by `Rc`.
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
warning: useless use of `vec!`
--> test_infra/src/lib.rs:111:30
|
111 | let mut events = vec![epoll::Event::new(epoll::Events::empty(), 0); 1];
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: you can use an array directly: `[epoll::Event::new(epoll::Events::empty(), 0); 1]`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#useless_vec
= note: `#[warn(clippy::useless_vec)]` on by default
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
This gives users the chance to reduce the number of dependencies
included, which is generally good practice and also reduces code size.
Furthermore, `io_uring` specifically is a strong contender for something
one may wish to disable due to the syscall API's many security issues[1]
[1]: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html
Signed-off-by: Manish Goregaokar <manishsmail@gmail.com>
This does the same thing as df2a7c17 ("vmm: Ignore and warn TAP FDs
sent via the HTTP request body"), but for the vm.create endpoint,
which also previously would accept file descriptors in the body, and
try to use whatever fd occupied that number as a TAP device.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Port of df2a7c17 ("vmm: Ignore and warn TAP FDs sent via the HTTP
request body"), but for the vm.create endpoint, which would previously
accept file descriptors in the body, and try to use whatever fd
occupied that number as a TAP device.
Since I had to move the wrapping of the net config in an Arc until
after it was modified, I made the same change to all other endpoints,
so the style stays consistent.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This commit renames `ram_region_sub_size` to `ram_region_available_size`
and make its value align down to the default page size or hugepage
size of the current memory zone, which can prevent the memory zone from
being split into misaligned parts. And if the available size of ram
region is zero, this region will be marked as consumed even it has
unused space.
Note that there is two methods to use hugepages.
1. Specify `hugepages` for `memory` or `memory-zone`, if the
`hugepage_size` is not specified, the value can be got by `statfs`
for `/dev/hugepages`.
2. Specify a `file` in hugetlbfs for `memory-zone`, the hugepage size
can also be got by `statfs` for the file.
The value for alignment will be the hugepage size if this memory zone
is using hugepages, otherwise the value will be default page size of
system.
Fixes: #5463
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
The previous `arch_memory_regions` function will provide some memory
regions with the specified memory size and fill all the previous
regions before using the next one, but sometimes there may be no need
to fill up the previous one, e.g., the previous one should be aligned
with hugepage size.
This commit make `arch_memory_regions` function not take any
parameters and return the max available regions, the memory manager
can use them on demand.
Fixes: #5463
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
SerialBuffer uses VecDeque::extend, which calls realloc, which a
maximum buffer size of 1 MiB. Starting at allocation sizes of
128 KiB, musl's mallocng allocator will use mremap for the allocation.
Since this was not permitted by the seccomp rules, heavy write load
could crash cloud-hypervisor with a seccomp failure. (Encountered
using virtio-console, but I don't see any reason it wouldn't happen
for the legacy serial device too.)
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Bump to the latest rust-vmm crates, including vm-memory, vfio,
vfio-bindings, vfio-user, virtio-bindings, virtio-queue, linux-loader,
vhost, and vhost-user-backend,
Signed-off-by: Bo Chen <chen.bo@intel.com>
Introduces three new CLI options, `dbus-service-name`,
`dbus-object-path` and `dbus-system-bus` to configure the DBus API.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
This commit applies the previously created seccomp filter
to the `DbusApi` thread.
Also encloses the main loop of the `DBusApi` thread using
`std::panic::catch_unwind` and `AssertUnwindSafe` in order to mirror
the behavior of the HTTP API.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
This commit adds support for graceful shutdown of the DBusApi thread
using `futures::channel::oneshot` channels. By using oneshot channels,
we ensure that the thread has enough time to send a response to the
`VmmShutdown` method call before it is terminated. Without this step,
the thread may be terminated before it can send a response, resulting
in an error message on the client side stating that the message
recipient disconnected from the message bus without providing a reply.
Also changes the default values for DBus service name, object path
and interface name.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
This commit introduces three new dependencies: `zbus`, `futures`
and `blocking`. `blocking` is used to call the Internal API in zbus'
async context which is driven by `futures::executor`. They are all
behind the `dbus_api` feature flag.
The D-Bus API implementation is behind the same `dbus_api` feature
flag as well.
Signed-off-by: Omer Faruk Bayram <omer.faruk@sartura.hr>
The refactoring on deferring address space allocation (#5169) broke TDX,
as TDX initialization needs to access guest memory for encryption and
measurement of guest pages.
Signed-off-by: Bo Chen <chen.bo@intel.com>
The current implementation of breadth first traversal for device tree
uses a temporary vector, therefore causes unnecessary memory copy.
Remove it and do it within vector nodes.
Signed-off-by: Hao Xu <howeyxu@tencent.com>