HV APIC(i.e., synthetic APIC controller exposed by Microsoft Hypervisor)
does not support one-shot operation using a TSC deadline value. Due to
which we see the following backtrace inside the guest when running with
hypervisor-fw/OVMF:
[ 0.560765] unchecked MSR access error: WRMSR to 0x832 (tried to
write 0x00000000000400ec) at rIP: 0xffffffff8f473594
(native_write_msr+0x4/0x30)
[ 0.560765] Call Trace:
[ 0.560765] ? native_apic_msr_write+0x2b/0x30
[ 0.560765] __setup_APIC_LVTT+0xbc/0xe0
[ 0.560765] lapic_timer_set_oneshot+0x27/0x30
[ 0.560765] clockevents_switch_state+0xaf/0xf0
[ 0.560765] tick_setup_periodic+0x47/0x90
[ 0.560765] tick_setup_device.isra.0+0x7c/0x110
[ 0.560765] tick_check_new_device+0xce/0xf0
[ 0.560765] clockevents_register_device+0x82/0x170
[ 0.560765] clockevents_config_and_register+0x2f/0x40
[ 0.560765] setup_APIC_timer+0xe1/0xf0
[ 0.560765] setup_boot_APIC_clock+0x5f/0x66
[ 0.560765] native_smp_prepare_cpus+0x1d6/0x286
[ 0.560765] kernel_init_freeable+0xcf/0x255
[ 0.560765] ? rest_init+0xb0/0xb0
[ 0.560765] kernel_init+0xe/0x110
[ 0.560765] ret_from_fork+0x22/0x40
Also, if this feature is exposed guest would not finish booting and get
stuck right before unpacking the root filesystem.
Fixes: 06e8d1c40 ("hypervisor: mshv: fix topology for Intel HW on MSHV")
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
This requires stashing the config values in `struct Vmm`. The configs
should be validated before before creating the VMM thread. Refactor the
code and update documentation where necessary.
The place where the rules are applied remain the same.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Add file/dir paths from landlock-rules arguments to ruleset. Invoke
apply_landlock on VmConfig to apply config specific rules to ruleset.
Once done, any threads spawned by vmm thread will be automatically
sandboxed with the ruleset in vmm thread.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Introduce ApplyLandlock trait and add implementations to VmConfig
elements with PathBufs. This trait adds config specific rules to
landlock ruleset.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Users can use this parameter to pass extra paths that 'vmm' and its
child threads can use at runtime. Hotplug is the primary usecase for
this parameter.
In order to hotplug devices that use local files: disks, memory zones,
pmem devices etc, users can use this option to pass the path/s that will
be used during hotplug while starting cloud-hypervisor. Doing this will
allow landlock to add required rules to grant access to these paths when
cloud-hypervisor process starts.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Users can use this cmdline option to enable/disable Landlock based
sandboxing while running cloud-hypervisor.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
landlock syscalls are required by event_monitor, signal_handler,
http-server and vmm threads. Rest of the threads are spawned by the vmm
thread and they automatically inherit the ruleset from the vmm thread.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
In 42e9632c53 a fix was made to address a
typo in the taplo configuration file. Fixing this typo indicated that
many Cargo.toml files were no longer adhering to the formatting rules.
Fix the formatting by running `taplo fmt`.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
While checking if the console device is a tty use the cloned fd instead
of libc::STDOUT_FILENO.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Console devices are created after vm_config is received and the created
devices are passed Vm during vm_receive_state.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
During vm_shutdown or vm_snapshot, all the console devices will be
closed. When this happens stdout (FD #2) will also be closed as the
console device using these FD is closed. If the VM were to be started
later, FD#2 can be assigned to a different file. But
pre_create_console_devices looks for FD#2 while opening tty device,
which could point to any file.
To avoid this problem, the STDOUT FD is duplicated when being
assigned to a console device. Even if the console devices were to be
closed, the duplicated FD will be closed and FD#2 will continue to
point to STDOUT.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
While adding console devices, DeviceManager will now use the FDs in
console_info instead of creating them.
To reduce the size of this commit, I marked some variables are unused
with '_' prefix. All those variables are cleaned up in next commit.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Use pre_create_console_devices method to create and populate console
device FDs into console_info in Vmm Object.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
With this change all the information to manage console devices is now
available within Vmm Object.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Introduce ConsoleInfo struct. This struct will be used to store FDs of
console devices created in pre_create_console_devices and passed to
vm_boot.
Move set_raw_mode, create_pty methods to console_devices.rs to
consolidate console management methods into a single module.
Lastly, copy the logic to create and configure console devices into
pre_create_console_devices method.
Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Misspellings were identified by:
https://github.com/marketplace/actions/check-spelling
* Initial corrections based on forbidden patterns from the action
* Additional corrections by Google Chrome auto-suggest
* Some manual corrections
* Adding markdown bullets to readme credits section
Signed-off-by: Josh Soref <2119212+jsoref@users.noreply.github.com>
Currently by default each core is allocated it's own socket. Basically
it is n socket 1 core 1 thread/core kind of a structure as witnessed
from within the guest.
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 8
NUMA node(s): 1
This is not a good default topology because resources are distributed
across multiple sockets. For example, a Linux guest with multi socket
configuration will have to calibrate TSC per socket due to which it
might observe a higher amount of boot time than usual.
A better idea for default topology would be 1 socket n core 1
thread/core which ensure better resource locality.
After this change topology would change to:
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Fixes: #6497
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
The original code gave an owned fd to UnixListener. That made the same
fd wrapped into two owned files.
When the files were dropped, the same fd would be closed more than once.
A newly introduced check in Rust's stdlib caught that error.
A newly cloned fd should be given to UnixListener.
Fixes: #6485
Signed-off-by: Wei Liu <liuwe@microsoft.com>
For MSHV we always create frozen partition, so we
resume the VM during boot. Also during pause and resume
VM events we call hypervisor specific API.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Consume FDs passed via SCM_RIGHTs to VmRestore API and assign them
appropriately to RestoredNetConfig's fds field.
Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
'NetConfig' FDs, when explicitly passed via SCM_RIGHTS during VM
creation, are marked as invalid during snapshot. See: #6332.
So, Restore should support input for the new net FDs. This patch adds
new field 'net_fds' to 'RestoreConfig'. The FDs passed using this new
field are replaced into the 'fds' field of NetConfig appropriately.
The 'validate()' function ensures all net devices from 'VmConfig' backed
by FDs have a corresponding 'RestoreNetConfig' with a matched 'id' and
expected number of FDs.
The unit tests provide different inputs to parse and validate functions
to make sure parsing and error handling is as per expectation.
Fixes#6286
Signed-off-by: Purna Pavan Chandra <paekkaladevi@linux.microsoft.com>
Co-authored-by: Bo Chen <chen.bo@intel.com>
The compiler is now able to warn if an invalid attribute (e.g like a
feature) is not available.
See https://blog.rust-lang.org/2024/05/06/check-cfg.html for more
details.
Add build.rs files in the crates that use #cfg(fuzzing) to add fuzzing
to the list of valid cfg attributes.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Code in this crate is conditional on this feature so it necessary to
expose as a new feature and use that feature as a dependency when the
feature is enabled on the vmm crate.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
The "dhat-heap" feature needs to be enabled inside the vmm crate as a
depenency from the top-level as there is build time check for that
feature inside the vmm crate.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Update the vhost-user-backend crate version used along with related
crates (vhost and virtio-queue.) This requires minor changes to the
types used for the memory in the backends with the use of the
BitmapMmapRegion type for the Bitmap implementation.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
This is to resolve the inconsistencies from our openapi specification,
as default values do not make sense for required fields.
Reported-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Bo Chen <chen.bo@intel.com>
As MSHV also implements set/get_clock data, this patch
removes the KVM feature guard and make it x86_64 only and
both for KVM and MSHV.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The contents of this crate may change and cause conflicts - re-exporting
the contents is unnecessary.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
The members for {Io, Mmio}{Read, Write} are unused as instead exits of
those types are handled through the VmOps interface. Removing these is
also a prerequisite due to changes in the mutability of the
VcpuFd::run() method.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When using multiple PCI segments, the 32-bit and 64-bit mmio
aperture is split equally between each segment. Add an option
to configure the 'weight'. For example, a PCI segment with a
`mmio32_aperture_weight` of 2 will be allocated twice as much
32-bit mmio space as a normal PCI segment.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Assigning KVM_IRQFD (when unmasking a guest IRQ) after
KVM_SET_GSI_ROUTING can avoid kernel panic on the guest that is not
patched with commit a80ced6ea514 (KVM: SVM: fix panic on out-of-bounds
guest IRQ) on AMD systems.
Meanwhile, it is required to deassign KVM_IRQFD (when masking a guest
IRQ) before KVM_SET_GSI_ROUTING (see #3827).
Fixes: #6353
Signed-off-by: Yi Wang <foxywang@tencent.com>
Signed-off-by: Bo Chen <chen.bo@intel.com>
Add infrastructure to lookup the host address for mmio regions on
external dma mapping requests. This specifically resolves vfio
passthrough for virtio-iommu, allowing for nested virtualization to pass
external devices through.
Fixes#6110
Signed-off-by: Andrew Carp <acarp@crusoeenergy.com>
VfioUserDmaMapping is already in the pci crate, this moves
VfioDmaMapping to match the behavior. This is a necessary change to
allow the VfioDmaMapping trait to have access to MmioRegion memory
without creating a circular dependency. The VfioDmaMapping trait
needs to have access to mmio regions to map external devices over
mmio (a follow-up commit).
Signed-off-by: Andrew Carp <acarp@crusoeenergy.com>
The memory region that is associated with the hotpluggable part of
a virtio-mem zone isn't backed by the file specified in the
MemoryZoneConfig. The file is used only for the fixed part of the
zone. When you try to restore a snapshot with virtio-mem, the
backing file is used for all its regions. This results in the
following error:
VmRestore(MemoryManager(GuestMemoryRegion(MappingPastEof)))
This patch sets backing_file only for the fixed part of a virtio-mem
zone.
Fixes: #6337
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Add IOCTL number for generic hypercall ioctl (MSHV_ROOT_HVCALL).
Update IOCTL numbers for set/get vp state.
Signed-off-by: Nuno Das Neves <nudasnev@microsoft.com>
The 'NetConfig' may contain FDs which can't be serialized correctly, as
FDs can only be donated from another process via a Unix domain socket
with `SCM_RIGHTS`. To avoid false use of the serialized FDs, this patch
explicitly set 'NetConfig' FDs as invalid for (de)serialization.
See: #6286
Signed-off-by: Bo Chen <chen.bo@intel.com>
warning: assigning the result of `Clone::clone()` may be inefficient
--> vmm/src/device_manager.rs:4188:17
|
4188 | id = child_id.clone();
| ^^^^^^^^^^^^^^^^^^^^^ help: use `clone_from()`: `id.clone_from(child_id)`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#assigning_clones
= note: `#[warn(clippy::assigning_clones)]` on by default
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
A deadlock can happen from the destination VM of live upgrade or
migration due to waiting on paused device worker threads. For example,
when a serialization error happens after the `DeviceManager` struct is
restored (where all virtio device worker threads are spawned but in
paused/parked state), a deadlock will happen from
`DeviceManager::drop()`, as it blocks for waiting worker threads to
join.
This patch ensures that we wake up all device (mostly virtio) worker
threads before we block for them to join.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Currently, we only support injecting NMI for KVM guests but we can do
the same for MSHV guests as well to have feature parity.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
This commit ensures that the HttpApi thread flushes all the responses
before the application shuts down. Without this step, in case of a
VmmShutdown request the application might terminate before the
thread sends a response.
Fixes: #6247
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
On platforms where PCIe P2P is supported, inject a PCI capability into
NVIDIA GPU to indicate support.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Host data that is passed to the hypervisor. Then
the firmware includes the data in the attestation report.
The data might include any key or secret that the SevSnp guest
might need later.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The host data provided at launch. Data is passed
to the hypervisor during the completion of the
isolated import.
Host Data provided by the hypervisor during guest launch.
The firmware includes this value in all attestation
reports for the guest.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
With the nightly toolchain (2024-02-18) cargo check will flag up
redundant imports either because they are pulled in by the prelude on
earlier match.
Remove those redundant imports.
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
When a guest running on a terminal reboots, the sigwinch_listener
subprocess exits and a new one restarts. The parent never wait()s
for children, so the old subprocess remains as a zombie. With further
reboots, more and more zombies build up.
As there are no other children for which we want the exit status,
the easiest fix is to take advantage of the implicit reaping specified
by POSIX when we set the disposition of SIGCHLD to SIG_IGN.
For this to work, we also need to set the correct default exit signal
of SIGCHLD when using clone3() CLONE_CLEAR_SIGHAND. Unlike the fallback
fork() path, clone_args::default() initialises the exit signal to zero,
which results in a child with non-standard reaping behaviour.
Signed-off-by: Chris Webb <chris@arachsys.com>
Allow cloud-hypervisor to direct boot the bzImage kernel format using
the regular 32 bit entry point. This can share the memory and vcpu
setup with the regular PVH boot code, but requires the setup of the
'zero page'.
Signed-off-by: Stefan Nuernberger <stefan.nuernberger@cyberus-technology.de>
In case of SEV-SNP guest devices use sw-iotlb to gain access guest
memory for DMA. For that F_IOMMU/F_ACCESS_PLATFORM must be exposed in
the feature set of virtio devices.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The 'generate_ram_ranges' function currently hardcodes the assumption
that there are only 2 E820 RAM entries. This is not flexible enough to
handle vendor specific memory holes. Returning a Vec is also more
convenient for users of this function.
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
Since the ACPI tables are generated inside the IGVM file in case of
SEV-SNP guest. So, we don't need to generate it inside the cloud
hypervisor.
Signed-off-by: Jinank Jain <jinankjain@microsoft.com>
Signed-off-by: Muminul Islam <muislam@microsoft.com>
We occasionally saw cloud-hypervisor crashed due to seccomp violations. The
coredumps showed the HTTP API thread crashing after it attempted to call
sched_yield(). The call came from rust stdlib's mpmc module, which calls
sched_yield() if several attempts to busy-wait for a condition to fulfil fall
short.
Since the system call is harmless and it comes from the stdlib, I opted to allow
all threads to call it.
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>