Commit Graph

427 Commits

Author SHA1 Message Date
Sebastien Boeuf
58d8206e2b migration: Use MemoryManager restore code path
Instead of creating a MemoryManager from scratch, let's reuse the same
code path used by snapshot/restore, so that memory regions are created
identically to what they were on the source VM.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-10-06 18:35:49 -07:00
Sebastien Boeuf
6a55768d94 vmm: Create MemoryManager from restore data
Extending the MemoryManager::new() function to be able to create a
MemoryManager from data that have been previously stored instead of
always creating everything from scratch.

This change brings real added value as it allows a VM to be restored
respecting the proper memory layout instead of hoping the regions will
be created the way they were before.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-10-06 18:35:49 -07:00
Rob Bradford
83066cf58e vmm: Set a default maximum physical address size
When using PVH for booting (which we use for all firmwares and direct
kernel boot) the Linux kernel does not configure LA57 correctly. As such
we need to limit the address space to the maximum 4-level paging address
space.

If the user knows that their guest image can take advantage of the
5-level addressing and they need it for their workload then they can
increase the physical address space appropriately.

This PR removes the TDX specific handling as the new address space limit
is below the one that that code specified.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-10-01 08:59:15 -07:00
Sebastien Boeuf
495e444ca6 vmm: Add ACPI tables to TdVmmData when running TDX
Whenever running TDX, we must pass the ACPI tables to the TDVF firmware
running in the guest. The proper way to do this is by adding the tables
to the TdHob as a TdVmmData type, so that TDVF will know how to access
these tables and expose them to the guest OS.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-30 06:35:55 -07:00
Sebastien Boeuf
b99a3a7dc9 vmm: Factorize ACPI tables creation inside boot() function
Instead of having the ACPI tables being created both in x86_64 and
aarch64 implementations of configure_system(), we can remove the
duplicated code by moving the ACPI tables creation in vm.rs inside the
boot() function.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-30 06:35:55 -07:00
Yu Li
08021087ec vmm: add prefault option in memory and memory-zone
The argument `prefault` is provided in MemoryManager, but it can
only be used by SGX and restore.
With prefault (MAP_POPULATE) been set, subsequent page faults will
decrease during running, although it will make boot slower.

This commit adds `prefault` in MemoryConfig and MemoryZoneConfig.
To resolve conflict between memory and restore, argument
`prefault` has been changed from `bool` to `Option<bool>`, when
its value is None, config from memory will be used, otherwise
argument in Option will be used.

Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2021-09-29 14:17:35 +02:00
Sebastien Boeuf
59031531b6 vmm: Simplify the way memory is snapshot and restored
By using a single file for storing the memory ranges, we simplify the
way snapshot/restore works by avoiding multiples files, but the main and
more important point is that we have now a way to save only the ranges
that matter. In particular, the ranges related to virtio-mem regions are
not always fully hotplugged, meaning we don't want to save the entire
region. That's where the usage of memory ranges is interesting as it
lets us optimize the snapshot/restore process when one or multiple
virtio-mem regions are involved.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-28 10:15:22 -07:00
Sebastien Boeuf
1ea63f50a1 vmm: Move MemoryRangeTable creation to the MemoryManager
The function memory_range_table() will be reused by the MemoryManager in
a following patch to describe all the ranges that we should snapshot.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-28 10:15:22 -07:00
Sebastien Boeuf
86f86c5348 vmm: Optimize migration for virtio-mem
Copy only the memory ranges that have been plugged through virtio-mem,
allowing for an interesting optimization regarding the time it takes to
migrate a large virtio-mem device. Even if the hotpluggable space is
very large (say 64GiB), if only 1GiB has been previously added to the
VM, only 1GiB will be sent to the destination VM, avoiding the transfer
of the remaining 63GiB which are unused.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-28 10:15:22 -07:00
Sebastien Boeuf
b910a7922d vmm: Fix migration when writing/reading big chunks of data
Both read_exact_from() and write_all_to() functions from the GuestMemory
trait implementation in vm-memory are buggy. They should retry until
they wrote or read the amount of data that was expected, but instead
they simply return an error when this happens. This causes the migration
to fail when trying to send important amount of data through the
migration socket, due to large memory regions.

This should be eventually fixed in vm-memory, and here is the link to
follow up on the issue: https://github.com/rust-vmm/vm-memory/issues/174

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-09-27 11:13:56 +02:00
Rob Bradford
1a2d0e6dd8 build: bump linux-loader from 0.3.0 to 0.4.0
Requires manual change to command line loading.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-24 09:11:57 +00:00
Rob Bradford
0faa7afac2 vmm: Add fast path for PCI config IO port
Looking up devices on the port I/O bus is time consuming during the
boot at there is an O(lg n) tree lookup and the overhead from taking a
lock on the bus contents.

Avoid this by adding a fast path uses the hardcoded port address and
size and directs PCI config requests directly to the device.

Command line:
target/release/cloud-hypervisor --kernel ~/src/linux/vmlinux --cmdline "root=/dev/vda1 console=ttyS0" --serial tty --console off --disk path=~/workloads/focal-server-cloudimg-amd64-custom-20210609-0.raw --api-socket /tmp/api

PIO exit: 17913
PCI fast path: 17871
Percentage on fast path: 99.8%

perf before:

marvin:~/src/cloud-hypervisor (main *)$ perf report -g | grep resolve
     6.20%     6.20%  vcpu0            cloud-hypervisor    [.] vm_device:🚌:Bus::resolve

perf after:

marvin:~/src/cloud-hypervisor (2021-09-17-ioapic-fast-path *)$ perf report -g | grep resolve
     0.08%     0.08%  vcpu0            cloud-hypervisor    [.] vm_device:🚌:Bus::resolve

The compromise required to implement this fast path is bringing the
creation of the PciConfigIo device into the DeviceManager::new() so that
it can be used in the VmmOps struct which is created before
DeviceManager::create_devices() is called.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-17 17:09:45 +01:00
Michael Zhao
253c06d3ba arch/aarch64: Add virtio-iommu device in FDT
Add a virtio-iommu node into FDT if iommu option is turned on. Now we
support only one virtio-iommu device.

Signed-off-by: Michael Zhao <michael.zhao@arm.com>
2021-09-17 12:19:46 +02:00
William Douglas
46f6d9597d vmm: Switch to using the serial_manager for serial input
This change switches from handling serial input in the VMM thread to
its own thread controlled by the SerialManager.

The motivation for this change is to avoid the VMM thread being unable
to process events while serial input is happening and vice versa.

The change also makes future work flushing the serial buffer on PTY
connections easier.

Signed-off-by: William Douglas <william.douglas@intel.com>
2021-09-17 11:15:35 +01:00
Alyssa Ross
330b5ea3be vmm: notify virtio-console of pty resizes
When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize.  This is the only way to be notified
by the kernel of a pty resize.

We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in.  To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty.  The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal.  This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.

Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity.  I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.

I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process.  The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2021-09-14 15:43:25 +01:00
Alyssa Ross
28382a1491 virtio-devices: determine tty size in console
This prepares us to be able to handle console resizes in the console
device's epoll loop, which we'll have to do if the output is a pty,
since we won't get SIGWINCH from it.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2021-09-14 15:43:25 +01:00
Rob Bradford
d64a77a5c6 vmm: Shutdown VMM if signal thread panics
See: #3031

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-08 11:26:48 -07:00
Rob Bradford
e0d05683ab vmm: Split up functions for creating signal handler and tty setup
These are quite separate and should be in their own functions.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-08 11:26:48 -07:00
Rob Bradford
387753ae1d vmm: Remove concept of "input_enabled"
This concept ends up being broken with multiple types on input connected
e.g. console on TTY and serial on PTY. Already the code for checking for
injecting into the serial device checks that the serial is configured.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-08 11:26:48 -07:00
Rob Bradford
0dbb2683e3 vmm: Consolidate duplicated code for setting up signal handler
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-08 11:26:48 -07:00
Alyssa Ross
7549149bb5 vmm: ensure signal handlers run on the right thread
Despite setting up a dedicated thread for signal handling, we weren't
making sure that the signals we were listening for there were actually
dispatched to the right thread.  While the signal-hook provides an
iterator API, so we can know that we're only processing the signals
coming out of the iterator on our signal handling thread, the actual
signal handling code from signal-hook, which pushes the signals onto
the iterator, can run on any thread.  This can lead to seccomp
violations when the signal-hook signal handler does something that
isn't allowed on that thread by our seccomp policy.

To reproduce, resize a terminal running cloud-hypervisor continuously
for a few minutes.  Eventually, the kernel will deliver a SIGWINCH to
a thread with a restrictive seccomp policy, and a seccomp violation
will trigger.

As part of this change, it's also necessary to allow rt_sigreturn(2)
on the signal handling thread, so signal handlers are actually allowed
to run on it.  The fact that this didn't seem to be needed before
makes me think that signal handlers were almost _never_ actually
running on the signal handling thread.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2021-09-02 21:33:31 +01:00
Rob Bradford
c2144b5690 vmm, virtio-console: Move input reading into virtio-console thread
Move the processing of the input from stdin, PTY or file from the VMM
thread to the existing virtio-console thread. The handling of the resize
of a virtio-console has not changed but the name of the struct used to
support that has been renamed to reflect its usage.

Fixes: #3060

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-09-02 21:17:33 +01:00
Henry Wang
0d01eac1d4 vmm: Do the downcast of GicDevice in a safer way for AArch64
Downcasting of GicDevice trait might fail. Therefore we try to
downcast the trait first and only if the downcasting succeeded we
can then use the object to call methods. Otherwise, do nothing and
log the failure.

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-09-02 15:18:41 +01:00
Rob Bradford
4d2a4e2805 vmm: Handle epoll events for PTYs separately
Use two separate events for the console and serial PTY and then drive
the handling of the inputs on the PTY separately. This results in the
correct behaviour when both console and serial are attached to the PTY
as they are triggered separately on the epoll so events are not lost.

Fixes: #3012

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-08-25 13:33:32 +01:00
Rob Bradford
6233f6f68e vmm: Send tty input to correct destination
Check the config to find out which device is attached to the tty and
then send the input from the user into that device (serial or
virtio-console.)

Fixes: #3005

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-08-25 10:08:25 +01:00
Bo Chen
7d38a1848b virtio-devices, vmm: Fix the '--seccomp false' option
We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-08-18 10:42:19 +02:00
Bo Chen
08ac3405f5 virtio-devices, vmm: Move to the seccompiler crate
Fixes: #2929

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-08-18 10:42:19 +02:00
Rob Bradford
53b2e19934 vmm: Add support for hotplugging user devices
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2021-08-12 13:19:04 +01:00
Henry Wang
5a0a4bc505 arch: Add optional distance-map node to FDT
The optional device tree node distance-map describes the relative
distance (memory latency) between all NUMA nodes.

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-08-12 10:49:02 +02:00
Henry Wang
165364e08b vmm: Move NUMA node data structures to arch
This is to make sure the NUMA node data structures can be accessed
both from the `vmm` crate and `arch` crate.

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-08-12 10:49:02 +02:00
Henry Wang
20aa811de7 vmm: Extend NUMA setup to more than ACPI
The AArch64 platform provides a NUMA binding for the device tree,
which means on AArch64 platform, the NUMA setup can be extended to
more than the ACPI feature.

Based on above, this commit extends the NUMA setup and data
structures to following scenarios:

- All AArch64 platform
- x86_64 platform with ACPI feature enabled

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Michael Zhao <Michael.Zhao@arm.com>
2021-08-12 10:49:02 +02:00
Sebastien Boeuf
5a83ebce64 vmm: Notify Migratable objects about migration being complete
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-08-10 12:36:58 -07:00
Sebastien Boeuf
06729bb3ba vmm: Provide a restoring state to the DeviceManager
In anticipation for creating vhost-user devices in a different way when
being restored compared to a fresh start, this commit introduces a new
boolean created by the Vm depending on the use case, and passed down to
the DeviceManager. In the future, the DeviceManager will use this flag
to assess how vhost-user devices should be created.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-08-10 12:36:58 -07:00
Henry Wang
7fb980f17b arch, vmm: Pass cpu topology configuation to FDT
In an Arm system, the hierarchy of CPUs is defined through three
entities that are used to describe the layout of physical CPUs in
the system:

- cluster
- core
- thread

All these three entities have their own FDT node field. Therefore,
This commit adds an AArch64-specific helper to pass the config from
the Cloud Hypervisor command line to the `configure_system`, where
eventually the `create_fdt` is called.

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-08-05 21:19:16 +08:00
Sebastien Boeuf
5c6139bbff vmm: Finalize migration support for all devices
Make sure the DeviceManager is triggered for all migration operations.
The dirty pages are merged from MemoryManager and DeviceManager before
to be sent up to the Vmm in lib.rs.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-08-05 06:07:00 -07:00
Sebastien Boeuf
0411064271 vmm: Refactor migration through Migratable trait
Now that Migratable provides the methods for starting, stopping and
retrieving the dirty pages, we move the existing code to these new
functions.

No functional change intended.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-08-05 06:07:00 -07:00
Sebastien Boeuf
dcc646f5b1 clippy: Fix redundant allocations
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-29 13:28:57 +02:00
Bo Chen
b00a6a8519 vmm: Create guest memory regions with explicit dirty-pages-log flags
As we are now using an global control to start/stop dirty pages log from
the `hypervisor` crate, we need to explicitly tell the hypervisor (KVM)
whether a region needs dirty page tracking when it is created.

This reverts commit f063346de3.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-07-28 09:08:32 -07:00
Bo Chen
ca09638491 vmm: Add CPUID compatibility check for snapshot/restore
Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-07-28 09:26:02 +02:00
Bo Chen
f063346de3 vmm: Create guest memory regions without dirty-pages-log by default
With the support of dynamically turning on/off dirty-pages-log during
live-migration (only for guest RAM regions), we now can create guest
memory regions without dirty-pages-log by default both for guest RAM
regions and other regions backed by file/device.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-07-26 09:19:35 -07:00
Bo Chen
5e0d498582 hypervisor, vmm: Add dynamic control of logging dirty pages
This patch extends slightly the current live-migration code path with
the ability to dynamically start and stop logging dirty-pages, which
relies on two new methods added to the `hypervisor::vm::Vm` Trait. This
patch also contains a complete implementation of the two new methods
based on `kvm` and placeholders for `mshv` in the `hypervisor` crate.

Fixes: #2858

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-07-26 09:19:35 -07:00
Sebastien Boeuf
3e482c9c74 vmm: Limit physical address space for TDX
When running TDX guest, the Guest Physical Address space is limited by
a shared bit that is located on bit 47 for 4 level paging, and on bit 51
for 5 level paging (when GPAW bit is 1). In order to keep things simple,
and since a 47 bits address space is 128TiB large, we ensure to limit
the physical addressable space to 47 bits when runnning TDX.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-20 15:00:04 +02:00
Sebastien Boeuf
05f7651cf5 vmm: Force VIRTIO_F_IOMMU_PLATFORM when running TDX
When running a TDX guest, we need the virtio drivers to use the DMA API
to share specific memory pages with the VMM on the host. The point is to
let the VMM get access to the pages related to the buffers pointed by
the virtqueues.

The way to force the virtio drivers to use the DMA API is by exposing
the virtio devices with the feature VIRTIO_F_IOMMU_PLATFORM. This is a
feature indicating the device will require some address translation, as
it will not deal directly with physical addresses.

Cloud Hypervisor takes care of this requirement by adding a generic
parameter called "force_iommu". This parameter value is decided based on
the "tdx" feature gate, and then passed to the DeviceManager. It's up to
the DeviceManager to use this parameter on every virtio device creation,
which will imply setting the VIRTIO_F_IOMMU_PLATFORM feature.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-20 14:47:01 +02:00
Sebastien Boeuf
6b710209b1 numa: Add optional sgx_epc_sections field to NumaConfig
This new option allows the user to define a list of SGX EPC sections
attached to a specific NUMA node.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-09 14:45:30 +02:00
Sebastien Boeuf
17c99ae00a vmm: Enable provisioning for SGX guest
The guest can see that SGX supports provisioning as it is exposed
through the CPUID. This patch enables the proper backing of this
feature by having the host open the provisioning device and enable
this capability through the hypervisor.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-07 14:56:38 +02:00
Sebastien Boeuf
5b6d424a77 arch, vmm: Fix TDVF section handling
This patch fixes a few things to support TDVF correctly.

The HOB memory resources must contain EFI_RESOURCE_ATTRIBUTE_ENCRYPTED
attribute.

Any section with a base address within the already allocated guest RAM
must not be allocated.

The list of TD_HOB memory resources should contain both TempMem and
TdHob sections as well.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2021-07-06 11:47:43 +02:00
Henry Wang
4da3bdcd6e vmm: Split restore device_manager and devices
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-07-05 22:51:56 +02:00
Henry Wang
95ca4fb15e vmm: vm: Enable snapshot/restore of GICv3ITS
This commit enables the snapshot/restore of GICv3ITS in the process
of VM snapshot/restore.

Signed-off-by: Henry Wang <Henry.Wang@arm.com>
2021-07-05 22:51:56 +02:00
Wei Liu
1f2915bff0 vmm: hypervisor: split set_user_memory_region to two functions
Previously the same function was used to both create and remove regions.
This worked on KVM because it uses size 0 to indicate removal.

MSHV has two calls -- one for creation and one for removal. It also
requires having the size field available because it is not slot based.

Split set_user_memory_region to {create/remove}_user_memory_region. For
KVM they still use set_user_memory_region underneath, but for MSHV they
map to different functions.

This fixes user memory region removal on MSHV.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
2021-07-05 09:45:45 +02:00
Michael Zhao
239e39ddbc vmm: Fix clippy warnings on AArch64
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
2021-06-24 08:59:53 -07:00