When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize. This is the only way to be notified
by the kernel of a pty resize.
We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in. To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty. The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal. This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.
Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity. I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.
I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process. The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This prepares us to be able to handle console resizes in the console
device's epoll loop, which we'll have to do if the output is a pty,
since we won't get SIGWINCH from it.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Musl often uses mmap to allocate memory where Glibc would use brk.
This has caused seccomp violations for me on the API and signal
handling threads.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
error: all if blocks contain the same code at the end
--> vmm/src/memory_manager.rs:884:9
|
884 | / Ok(mm)
885 | | }
| |_________^
Signed-off-by: Bo Chen <chen.bo@intel.com>
This concept ends up being broken with multiple types on input connected
e.g. console on TTY and serial on PTY. Already the code for checking for
injecting into the serial device checks that the serial is configured.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a common solution for spawning the virtio threads which will
make it easier to add the panic handling.
During this effort I discovered that there were no seccomp filters
registered for the vhost-user-net thread nor the vhost-user-block
thread. This change also incorporates basic seccomp filters for those as
part of the refactoring.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Current AArch64 power button is only for device tree using a PL061
GPIO controller device. Since AArch64 now supports ACPI, this
commit extend the power button on AArch64 to:
- Using GED for ACPI+UEFI boot.
- Using PL061 for device tree boot.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
These statements are useful for understanding the cause of reset or
shutdown of the VM and are not spammy so should be included at info!()
level.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Despite setting up a dedicated thread for signal handling, we weren't
making sure that the signals we were listening for there were actually
dispatched to the right thread. While the signal-hook provides an
iterator API, so we can know that we're only processing the signals
coming out of the iterator on our signal handling thread, the actual
signal handling code from signal-hook, which pushes the signals onto
the iterator, can run on any thread. This can lead to seccomp
violations when the signal-hook signal handler does something that
isn't allowed on that thread by our seccomp policy.
To reproduce, resize a terminal running cloud-hypervisor continuously
for a few minutes. Eventually, the kernel will deliver a SIGWINCH to
a thread with a restrictive seccomp policy, and a seccomp violation
will trigger.
As part of this change, it's also necessary to allow rt_sigreturn(2)
on the signal handling thread, so signal handlers are actually allowed
to run on it. The fact that this didn't seem to be needed before
makes me think that signal handlers were almost _never_ actually
running on the signal handling thread.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Move the processing of the input from stdin, PTY or file from the VMM
thread to the existing virtio-console thread. The handling of the resize
of a virtio-console has not changed but the name of the struct used to
support that has been renamed to reflect its usage.
Fixes: #3060
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Downcasting of GicDevice trait might fail. Therefore we try to
downcast the trait first and only if the downcasting succeeded we
can then use the object to call methods. Otherwise, do nothing and
log the failure.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements the GIC (including both GICv3 and GICv3ITS)
Pausable trait. The pause of device manager will trigger a "pause"
of GIC, where we flush GIC pending tables and ITS tables to the
guest RAM.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This prevents the boot of the guest kernel from being blocked by
blocking I/O on the serial output since the data will be buffered into
the SerialBuffer.
Fixes: #3004
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a dynamic buffer for storing output from the serial port. The
SerialBuffer implements std::io::Write and can be used in place of the
direct output for the serial device.
The internals of the buffer is a vector that grows dynamically based on
demand up to a fixed size at which point old data will be overwritten.
Currently the buffer is only flushed upon writes.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The rust-vmm crates we're pulling from git have renamed their main
branches. We need to update the branch names we're giving to Cargo,
or people who don't have these dependencies cached will get errors
like this when trying to build:
error: failed to get `vm-fdt` as a dependency of package `arch v0.1.0 (/home/src/cloud-hypervisor/arch)`
Caused by:
failed to load source for dependency `vm-fdt`
Caused by:
Unable to update https://github.com/rust-vmm/vm-fdt?branch=master#031572a6
Caused by:
object not found - no match for id (031572a6edc2f566a7278f1e17088fc5308d27ab); class=Odb (9); code=NotFound (-3)
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Remove the indirection of a dispatch table and simply use the enum as
the event data for the events.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use two separate events for the console and serial PTY and then drive
the handling of the inputs on the PTY separately. This results in the
correct behaviour when both console and serial are attached to the PTY
as they are triggered separately on the epoll so events are not lost.
Fixes: #3012
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Check the config to find out which device is attached to the tty and
then send the input from the user into that device (serial or
virtio-console.)
Fixes: #3005
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
vhdx_sync.rs in block_util implements traits to represent the vhdx
crate as a supported block device in the cloud hypervisor. The vhdx
is added to the block device list in device_manager.rs at the vmm
crate so that it can automatically detect a vhdx disk and invoke the
corresponding crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Fazla Mehrab <akm.fazla.mehrab@intel.com>
We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.
Signed-off-by: Bo Chen <chen.bo@intel.com>
It is forbidden that the same memory zone belongs to more than one
NUMA node. This commit adds related validation to the `--numa`
parameter to prevent the user from specifying such configuration.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The optional device tree node distance-map describes the relative
distance (memory latency) between all NUMA nodes.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This is to make sure the NUMA node data structures can be accessed
both from the `vmm` crate and `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The AArch64 platform provides a NUMA binding for the device tree,
which means on AArch64 platform, the NUMA setup can be extended to
more than the ACPI feature.
Based on above, this commit extends the NUMA setup and data
structures to following scenarios:
- All AArch64 platform
- x86_64 platform with ACPI feature enabled
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Michael Zhao <Michael.Zhao@arm.com>
Instead of panicking with an expect() function, the QcowDiskSync::new
function now propagates the error properly. This ensures the VMM will
not panic, which might be the source of weird errors if only one thread
exits while the VMM continues to run.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We cannot let vhost-user devices connect to the backend when the Block,
Fs or Net object is being created during a restore/migration. The reason
is we can't have two VMs (source and destination) connected to the same
backend at the same time. That's why we must delay the connection with
the vhost-user backend until the restoration is performed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code wasn't doing what it was expected to. The '?' was simply
returning the error to the top level function, meaning the Err() case in
the match was never hit. Moving the whole logic to a dedicated function
allows to identify when something got wrong without propagating to the
calling function, so that we can still stop the dirty logging and
unpause the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case the migration succeeds, the destination VM will be correctly
running, with potential vhost-user backends attached to it. We can't let
the source VM trying to reconnect to the same backends, which is why
it's safer to shutdown the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for creating vhost-user devices in a different way when
being restored compared to a fresh start, this commit introduces a new
boolean created by the Vm depending on the use case, and passed down to
the DeviceManager. In the future, the DeviceManager will use this flag
to assess how vhost-user devices should be created.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Correct operation of user devices (vfio-user) requires shared memory so
flag this to prevent it from failing in strange ways.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the vfio-user / user devices from the config. Currently hotplug
of the devices is not supported nor can they be placed behind the
(virt-)iommu.
Removal of the coldplugged device is however supported.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the user to specify devices that are running in a different
userspace process and communicated with vfio-user.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Allow vsocks to connect to Unix sockets on the host running
cloud-hypervisor with enabled seccomp.
Reported-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Tested-by: Franz Girlich <franz.girlich@tu-ilmenau.de>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
This doesn't really affect the build as we ship a Cargo.lock with fixed
versions in. However for clarity it makes sense to use fixed versions
throughout and let dependabot update them.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The optional Processor Properties Topology Table (PPTT) table is
used to describe the topological structure of processors controlled
by the OSPM, and their shared resources, such as caches. The table
can also describe additional information such as which nodes in the
processor topology constitute a physical package.
The ACPI PPTT table supports topology descriptions for ACPI guests.
Therefore, this commit adds the PPTT table for AArch64 to enable
CPU topology feature for ACPI.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In an Arm system, the hierarchy of CPUs is defined through three
entities that are used to describe the layout of physical CPUs in
the system:
- cluster
- core
- thread
All these three entities have their own FDT node field. Therefore,
This commit adds an AArch64-specific helper to pass the config from
the Cloud Hypervisor command line to the `configure_system`, where
eventually the `create_fdt` is called.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Make sure the DeviceManager is triggered for all migration operations.
The dirty pages are merged from MemoryManager and DeviceManager before
to be sent up to the Vmm in lib.rs.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that Migratable provides the methods for starting, stopping and
retrieving the dirty pages, we move the existing code to these new
functions.
No functional change intended.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch connects the dots between the vm.rs code and each Migratable
device, in order to make sure Migratable methods are correctly invoked
when migration happens.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for supporting the merge of multiple dirty pages coming
from multiple devices, this patch factorizes the creation of a
MemoryRangeTable from a bitmap, as well as providing a simple method for
merging the dirty pages regions under a single MemoryRangeTable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch adds all the seccomp rules missing for MSHV.
With this patch MSFT internal CI runs with seccomp enabled.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch adds a fallback path for sending live migration, where it
ensures the following behavior of source VM post live-migration:
1. The source VM will be paused only when the migration is completed
successfully, or otherwise it will keep running;
2. The source VM will always stop dirty pages logging.
Fixes: #2895
Signed-off-by: Bo Chen <chen.bo@intel.com>
This rule is needed to boot windows guest.
This bug was introduced while we tried to boot
windows guest on MSHV.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch modify the existing live migration code
to support MSHV. Adds couple of new functions to enable
and disable dirty page tracking. Add missing IOCTL
to the seccomp rules for live migration.
Adds necessary flags for MSHV.
This changes don't affect KVM functionality at all.
In order to get better performance it is good to
enable dirty page tracking when we start live migration
and disable it when the migration is done.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Right now, get_dirty_log API has two parameters,
slot and memory_size.
MSHV needs gpa to retrieve the page states. GPA is
needed as MSHV returns the state base on PFN.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
It's totally acceptable to snapshot and restore a virtio-fs device that
has no cache region, since this is a valid mode of functioning for
virtio-fs itself.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As we are now using an global control to start/stop dirty pages log from
the `hypervisor` crate, we need to explicitly tell the hypervisor (KVM)
whether a region needs dirty page tracking when it is created.
This reverts commit f063346de3.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Following KVM interfaces, the `hypervisor` crate now provides interfaces
to start/stop the dirty pages logging on a per region basis, and asks
its users (e.g. the `vmm` crate) to iterate over the regions that needs
dirty pages log. MSHV only has a global control to start/stop dirty
pages log on all regions at once.
This patch refactors related APIs from the `hypervisor` crate to provide
a global control to start/stop dirty pages log (following MSHV's
behaviors), and keeps tracking the regions need dirty pages log for
KVM. It avoids leaking hypervisor-specific behaviors out of the
`hypervisor` crate.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch adds a common function "Vmm::vm_check_cpuid_compatibility()"
to be shared by both live-migration and snapshot/restore.
Signed-off-by: Bo Chen <chen.bo@intel.com>
We now send not only the 'VmConfig' at the 'Command::Config' step of
live migration, but also send the 'common CPUID'. In this way, we can
check the compatibility of CPUID features between the source and
destination VMs, and abort live migration early if needed.
Signed-off-by: Bo Chen <chen.bo@intel.com>
With the support of dynamically turning on/off dirty-pages-log during
live-migration (only for guest RAM regions), we now can create guest
memory regions without dirty-pages-log by default both for guest RAM
regions and other regions backed by file/device.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch extends slightly the current live-migration code path with
the ability to dynamically start and stop logging dirty-pages, which
relies on two new methods added to the `hypervisor::vm::Vm` Trait. This
patch also contains a complete implementation of the two new methods
based on `kvm` and placeholders for `mshv` in the `hypervisor` crate.
Fixes: #2858
Signed-off-by: Bo Chen <chen.bo@intel.com>
Whenever a file descriptor is sent through the control message, it
requires fcntl() syscall to handle it, meaning we must allow it through
the list of syscalls authorized for the HTTP thread.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running TDX guest, the Guest Physical Address space is limited by
a shared bit that is located on bit 47 for 4 level paging, and on bit 51
for 5 level paging (when GPAW bit is 1). In order to keep things simple,
and since a 47 bits address space is 128TiB large, we ensure to limit
the physical addressable space to 47 bits when runnning TDX.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running a TDX guest, we need the virtio drivers to use the DMA API
to share specific memory pages with the VMM on the host. The point is to
let the VMM get access to the pages related to the buffers pointed by
the virtqueues.
The way to force the virtio drivers to use the DMA API is by exposing
the virtio devices with the feature VIRTIO_F_IOMMU_PLATFORM. This is a
feature indicating the device will require some address translation, as
it will not deal directly with physical addresses.
Cloud Hypervisor takes care of this requirement by adding a generic
parameter called "force_iommu". This parameter value is decided based on
the "tdx" feature gate, and then passed to the DeviceManager. It's up to
the DeviceManager to use this parameter on every virtio device creation,
which will imply setting the VIRTIO_F_IOMMU_PLATFORM feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This refactoring ensures all CPUID related operations are centralized in
`arch::x86_64` module, and exposes only two related public functions to
the vmm crate, e.g. `generate_common_cpuid` and `configure_vcpu`.
Signed-off-by: Bo Chen <chen.bo@intel.com>
In order to let a separate process open a TAP device and pass the file
descriptor through the control message mechanism, this patch adds the
support for sending a file descriptor over to the Cloud Hypervisor
process along with the add-net HTTP API command.
The implementation uses the NetConfig structure mutably to update the
list of fds with the one passed through control message. The list should
always be empty prior to this, as it makes no sense to provide a list of
fds once the Cloud Hypervisor process has already been started.
It is important to note that reboot is supported since the file
descriptor is duplicated upon receival, letting the VM only use the
duplicated one. The original file descriptor is kept open in order to
support a potential reboot.
Fixes#2525
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are some seccomp rules needed for MSHV
in virtio-devices but not for KVM. We only want to
add those rules based on MSHV feature guard.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The micro-http crate now uses recvmsg() syscall in order to receive file
descriptors through control messages. This means the syscall must be
part of the authorized list in the seccomp filters.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>