2024 Commits

Author SHA1 Message Date
Rob Bradford
b68add2d0d vmm: Enable THP when using anonymous memory
If the memory is not backed by a file then it is possible to enable
Transparent Huge Pages on the memory and take advantage of the benefits
of huge pages without requiring the specific allocation of an appropriate
number of huge pages.

TEST=Boot and see that in /proc/`pidof cloud-hypervisor`/smaps that the
region is now THPeligible (and that also pages are being used.)

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-11-09 16:51:21 +00:00
Rob Bradford
6e0bd73c90 build: Bump linux-loader from 0.6.0 to 0.7.0
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-11-02 11:02:00 +00:00
Bo Chen
a9ec0f33c0 misc: Fix clippy issues
Signed-off-by: Bo Chen <chen.bo@intel.com>
2022-11-02 09:41:43 +01:00
Rob Bradford
f4495de143 vmm: Improve handling of shared memory backing
As huge pages are always MAP_SHARED then where the shared memory would
be checked (for vhost-user and local migration) we can also check
instead for huge pages.

The checking is also extended to cover the memory zones based
configuration as well.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-31 22:28:29 +00:00
Rob Bradford
99d9a3d299 vmm: memory_manager: Avoid MAP_PRIVATE CoW with VFIO for hugepages too
We can't use MAP_ANONYMOUS and still have huge pages so MAP_SHARED is
effectively required when using huge pages.

Unfortunately it is not as simple as always forcing MAP_SHARED if
hugepages is on as this might be inappropriate in the backing file case
hence why there is additional complexity of assigning to mmap_flags on
each case and the MAP_SHARED is only turned on for the anonymous file
huge page case as well as anonymous shared file case.

See: #4805

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-31 22:28:29 +00:00
Rob Bradford
df7c728399 vmm: memory_manager: Only file back memory when required
If we do not need an anonymous file backing the memory then do not
create one.

As a side effect this addresses an issue with CoW (mmap with MAP_PRIVATE
but no MAP_ANONYMOUS) when the memory is pinned for VFIO.

Fixes: #4805

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-31 22:28:29 +00:00
Rob Bradford
1e5a4e8d77 vmm: memory_manager: Split filesystem backed and anonymous RAM creation
This simplifies the code somewhat making the code paths more readable.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-31 22:28:29 +00:00
Rob Bradford
ff3fb91ba6 vmm: Refactor creation of the FileOffset for GuestRegionMmap::new()
Create this earlier so that it is possible to pass a None in for
anonymous mappings.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-31 22:28:29 +00:00
Jinrong Liang
cb171d4a23 device_manager: Avoid checking io_uring support when it's not needed
After testing, io_uring_is_supported() causes about 38ms of
overhead when creating virtio-blk. By modifying the position
of io_uring_is_supported(), the overhead of creating virtio-blk
is reduced to less than 1ms when we close io_uring.

Signed-off-by: Jinrong Liang <cloudliang@tencent.com>
2022-10-27 22:21:51 -07:00
Wei Liu
b99b2bc990 memory_manager: use MFD_CLOEXEC flag when creating memory fd
Until there is a need for sharing the memory fd with a child process, we
should err on the safe side to close it on exec.

Signed-off-by: Wei Liu <liuwe@microsoft.com>
2022-10-27 09:20:08 +02:00
Sebastien Boeuf
1f0e5eb66a vmm: virtio-devices: Restore every VirtioDevice upon creation
Following the new design proposal to improve the restore codepath when
migrating a VM, all virtio devices are supplied with an optional state
they can use to restore from. The restore() implementation every device
was providing has been removed in order to prevent from going through
the restoration twice.

Here is the list of devices now following the new restore design:

- Block (virtio-block)
- Net (virtio-net)
- Rng (virtio-rng)
- Fs (vhost-user-fs)
- Blk (vhost-user-block)
- Net (vhost-user-net)
- Pmem (virtio-pmem)
- Vsock (virtio-vsock)
- Mem (virtio-mem)
- Balloon (virtio-balloon)
- Watchdog (virtio-watchdog)
- Vdpa (vDPA)
- Console (virtio-console)
- Iommu (virtio-iommu)

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-24 14:17:08 +02:00
Sebastien Boeuf
157db33d65 vmm: Refactor hypervisor::Vm creation on restore
This prevents from leaking implementation details to lib.rs, and rather
keep them in vm.rs.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-24 14:17:08 +02:00
Fabiano Fidêncio
b4e3942708 api: Fix vm.add-device argument type
The add_device() function, from the device manager code, takes a
DeviceConfig as a parameter, instead of a VmAddDevice.

The change was originally done as part of 34412c9b41126 and it didn't
break Kata Containers because the VmAddDevice and DeviceConfig structs
share most of their fields, besides the optional for serialization
`pci_segment`, which is not used by the client yet.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-21 11:09:55 -07:00
Sebastien Boeuf
c52ccf3992 vmm: migration: Create destination VM right before to restore it
This is preliminary work to ensure a migrated VM is created right before
it is restored. This will be useful when moving to a design where the VM
is both created and restored simultaneously from the Snapshot.

In details, that means the MemoryManager is the object that must be
created upon receiving the config from the source VM, so that memory
content can be later received and filled into the GuestMemory.
Only after these steps happened, the snapshot is received from the
source VM, and the actual Vm object can be created from both the
snapshot and the MemoryManager previously created.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-18 17:14:29 +02:00
Rob Bradford
a75d71f2c8 vmm: Reduce logging severity for unknown MMIO/PIO device accesses
These look alarming if you are booting with the a distro kernel which is
now a recommended approach.

See: #4786

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-17 10:08:36 -07:00
Bo Chen
96209e7a16 vmm: Remove the explicit call to 'Snapshottable:restore()'
The restore path of MemoryManager is handled specially without
implementing a `Snapshottable:restore()`. Removing the explicit call to
it along the migration code path to avoid confusions.

See: #4783

Signed-off-by: Bo Chen <chen.bo@intel.com>
2022-10-17 10:07:44 -07:00
Sebastien Boeuf
099cdd2af8 virtio-devices, vmm: vdpa: Implement live migration support
Vdpa now implements the Migratable trait, which allows the device to be
added to the DeviceTree and therefore allows live migrating any vDPA
device that supports being suspended.

Given a vDPA device can't be resumed from a suspended state without
having to reset everything, we don't support pause/resume for a vDPA
device, as well as snapshot/restore (which requires resume to be
supported).

In order for the migration to work locally, reusing the same device on
the same host machine, the vhost-vdpa handler is dropped after the
snapshot has been performed, which allows the destination VM to open the
device without any conflict about the device being busy.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-13 10:03:23 +02:00
Sebastien Boeuf
22be5f9d0f vmm: Extend list of authorized ioctls for vDPA
Adding VHOST_VDPA_GET_CONFIG_SIZE and VHOST_VDPA_SUSPEND to the list of
authorized ioctls for the vmm thread.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-13 10:03:23 +02:00
Bo Chen
37c3b0429a vmm: Make MemoryManager::create_ram_region() public
So that it can be reused externally, such as for fuzzing.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2022-10-12 16:09:27 +01:00
Anatol Belski
a18b08c682 seccomp: mshv: Allow create partition ioctl
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
2022-10-11 09:05:24 +01:00
Bo Chen
29cf637f3f vmm: Move 'default_serial/console()' to vm_config.rs
In this way, we have all functions related to generate default values of
vm-config structs in the same location.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2022-10-07 09:13:15 -07:00
Rob Bradford
83cc554f90 vmm: Remove deprecated VmConfig::{kernel, initramfs, cmdline} members
These have been replaced by members of PayloadConfig and should be
removed in v28.0 (mentioned in v26.0 release notes.)

Fixes: #4737

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-06 14:25:29 +01:00
Rob Bradford
7d8d27c1b4 vmm: Rename queue size / number of queues constants
These constants still referenced the long removed (separate vhost-user
structs.)

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-06 14:25:29 +01:00
Rob Bradford
d692dfb8e3 vmm: Move impl Default for ... to vm_config.rs
This is consistent when considering that some structs have a
`#[derive(Default)`] so it makes sense for the default implementations
to be in the same location.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-06 14:25:29 +01:00
Rob Bradford
7ad58457b0 vmm: Split structs from logic that make up VmConfig
Place the data structures that are required for constructing a VmConfig
into it's own module from the logic that exists to suppot them.

This is useful as a consumer of the API can now clearly see what data
structures make up the API for creating VMs.

This has no functional change and I made no attempt to clean up the
ordering (it's as in the original file) nor any other clean up.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-06 14:25:29 +01:00
Sebastien Boeuf
89677c3181 build: Bump clap from 3.2.22 to 4.0.9
Bumps [clap](https://github.com/clap-rs/clap) from 3.2.22 to 4.0.9.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v3.2.22...v4.0.9)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Moving to the major version 4 introduced some breaking changes which had
to be handled manually.

Fixes #4709

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-10-05 12:59:14 +01:00
Rob Bradford
2daab89987 vmm: Remove legacy firmware loading
This functionality was deprecated and is for removal in the upcoming
release.

Fixes: #4511

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-03 17:09:02 +01:00
Rob Bradford
1bc63e7848 vmm: Remove legacy I/O ports for ACPI
These addresses have been superseded and replaced with other I/O ports.

Fixes: #4483

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-10-03 17:08:57 +01:00
Bo Chen
2115a41568 openapi: Add 'firmware' to 'PayloadConfig'
This option is needed for the openapi consumer (e.g. Kata Containers) to
load firmware (e.g. td-shim) for booting.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2022-10-01 08:45:21 +01:00
Rob Bradford
06eb82d239 build: Consolidate "gdb" build feature into "guest_debug"
This simplifies the CI process but also logical with the existing
functionality under "guest_debug" (dumping guest memory).

Fixes: #4679

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-27 14:30:57 +01:00
Sebastien Boeuf
3bf3cca70a vhost_user_net: Allow user to set MTU
Adding the support for the user to set the MTU for the vhost-user-net
backend, which allows the integration test to be extended with the test
of the MTU parameter.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-27 10:37:35 +01:00
Sebastien Boeuf
903c08f8a1 net: Don't override default TAP interface MTU
Adjust MTU logic such that:
1. Apply an MTU to the TAP interface if the user supplies it
2. Always query the TAP interface for the MTU and expose that.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-27 10:37:35 +01:00
Rob Bradford
b2d1dd65f3 build: Remove "fwdebug" and "common" feature flags
This simplifes the buld and checks with very little overhead and the
fwdebug device is I/O port device on 0x402 that can be used by edk2 as a
very simple character device.

See: #4679

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-26 10:16:33 -07:00
Rob Bradford
66c092e69b build: Bump linux-loader from 0.5.0 to 0.6.0
Bumps [linux-loader](https://github.com/rust-vmm/linux-loader) from 0.5.0 to 0.6.0.
- [Release notes](https://github.com/rust-vmm/linux-loader/releases)
- [Changelog](https://github.com/rust-vmm/linux-loader/blob/main/CHANGELOG.md)
- [Commits](https://github.com/rust-vmm/linux-loader/compare/v0.5.0...v0.6.0)

---
updated-dependencies:
- dependency-name: linux-loader
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-24 09:54:18 +00:00
Rob Bradford
1202b9a07a vmm: Add some tracing of boot sequence
Add tracing of the VM boot sequence from the point at which the request
to create a VM is received to the hand-off to the vCPU threads running.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-22 18:09:31 +01:00
Sebastien Boeuf
76dbf85b79 net: Give the user the ability to set MTU
Add a new "mtu" parameter to the NetConfig structure and therefore to
the --net option. This allows Cloud Hypervisor's users to define the
Maximum Transmission Unit (MTU) they want to use for the network
interface that they create.

In details, there are two main aspects. On the one hand, the TAP
interface is created with the proper MTU if it is provided. And on the
other hand the guest is made aware of the MTU through the VIRTIO
configuration. That means the MTU is properly set on both the TAP on the
host and the network interface in the guest.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-21 16:20:57 +02:00
Sebastien Boeuf
f38056fc9e virtio-devices, vmm: Simplify virtio-mem resize operation
There's no need to delegate the resize operation to the virtio-mem
thread. This can come directly from the vmm thread which will use the
Mem object to update the VIRTIO configuration and trigger the interrupt
for the guest to be notified.

In order to achieve what's described above, the VirtioMemZone structure
now has a handle onto the Mem object directly. This avoids the need for
intermediate Resize and ResizeSender structures.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-20 13:43:40 +02:00
Rob Bradford
f32487f8e8 misc: Automatic beta clippy fixes
e.g. cargo clippy --all --tests --all-targets --fix --features=..

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-20 10:59:48 +01:00
Sebastien Boeuf
1849ffff31 vmm: Remove "amx" feature gate
Given the AMX x86 feature has been made available since kernel v5.17,
and given we don't have any test validating this feature, there's no
need to keep it behing a Rust feature gate.

Fixes #3996

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-16 15:03:31 +01:00
Rob Bradford
0e52be0909 vmm: Ensure default deserialisation for "amx" feature bit
This allows a migration from a binary not compiled with struct member to
be completed.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2022-09-16 15:03:31 +01:00
Sebastien Boeuf
3793ffe888 vmm: config: Move TDX to rely on PayloadConfig
Removing the option --tdx to specify that we want to run a TD VM. Rely
on --platform option by adding the "tdx" boolean parameter. This is the
new way for enabling TDX with Cloud Hypervisor.

Along with this change, the way to retrieve the firmware path has been
updated to rely on the recently introduced PayloadConfig structure.

Fixes #4556

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-05 12:14:59 +01:00
Sebastien Boeuf
b3bef3adda vmm: acpi: Don't declare MMIO config space through PCI buses
The PCI buses should not declare the address space related to the MMIO
config space given it's already declared in the MCFG table and through
the motherboard device PNP0C02 in the DSDT table.

The PCI MMIO config region for the segment was being wrongly exposed as
part of the _CRS for the ACPI bus device (using Memory32Fixed). Exposing
it via this object was ineffectual as the equivalent entry in the
PNP0C02 (_SB_.MBRD) marked those ranges as not usable via the kernel.
Either way, with both devices used by the kernel, the kernel will not
try and use those memory ranges for the device BARs. However under
td-shim on TDX the PNP0C02 device is not on the permitted list of
devices so the the memory ranges were not marked as unusable resulting
in the kernel attempting to allocate BARs that collided with the PCI
MMIO configuration space.

This is based on the kernel documentation PCI/acpi-info.rst which relies
on ACPI and PCI Firmware specifications. And here are the interesting
quotes from this document:

"""
Prior to the addition of Extended Address Space descriptors, the failure
of Consumer/Producer meant there was no way to describe bridge registers
in the PNP0A03/PNP0A08 device itself. The workaround was to describe the
bridge registers (including ECAM space) in PNP0C02 catch-all devices.
With the exception of ECAM, the bridge register space is device-specific
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need
to know about it.

PNP0C02 “motherboard” devices are basically a catch-all. There’s no
programming model for them other than “don’t use these resources for
anything else.” So a PNP0C02 _CRS should claim any address space that is
(1) not claimed by _CRS under any other device object in the ACPI
namespace and (2) should not be assigned by the OS to something else.

The address range reported in the MCFG table or by _CBA method (see
Section 4.1.3) must be reserved by declaring a motherboard resource. For
most systems, the motherboard resource would appear at the root of the
ACPI namespace (under _SB) in a node with a _HID of EISAID (PNP0C02),
and the resources in this case should not be claimed in the root PCI
bus’s _CRS. The resources can optionally be returned in Int15 E820 or
EFIGetMemoryMap as reserved memory but must always be reported through
ACPI as a motherboard resource.
"""

This change has been manually tested by running a VM with multiple
segments (4 segments), and by hotplugging an additional disk to the
segment number 2 (third segment).

From one shell:
"""
cloud-hypervisor \
    --cpus boot=1 \
    --memory size=1G \
    --kernel vmlinux \
    --cmdline "root=/dev/vda1 rw console=hvc0" \
    --disk path=jammy-server-cloudimg.raw \
    --api-socket /tmp/ch.sock \
    --platform num_pci_segments=4
"""

From another shell (after the VM is booted):
"""
ch-remote \
    --api-socket=/tmp/ch.sock \
    add-disk \
    path=test-disk.raw,id=disk2,pci_segment=2
"""

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-09-02 14:14:23 +02:00
Nuno Das Neves
784a3aaf3c devices: gic: use VgicConfig everywhere
Use VgicConfig to initialize Vgic.
Use Gic::create_default_config everywhere so we don't always recompute
redist/msi registers.
Add a helper create_test_vgic_config for tests in hypervisor crate.

Signed-off-by: Nuno Das Neves <nudasnev@microsoft.com>
2022-08-31 08:33:05 +01:00
Jianyong Wu
3a19573c69 vmm: unify payload load chain on AArch64 with x86_64
AArch64 can share the same way of loading payload with x86_64. It makes
the payload loading more consistent between different arches.

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2022-08-31 08:32:08 +01:00
Michael Zhao
b65639fad3 vmm:AArch64: move uefi_flash to memory manager
uefi_flash is used when load firmware, that is load payload depends on
device manager. move uefi_flash to memory manager can eliminate the
dependency.

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
2022-08-31 08:32:08 +01:00
Jianyong Wu
9b1452f258 vmm:AArch64: load standalone firmware
A new firmware item has been added into payload config, we need
extend ability to load standalone firmware on AArch64.
"load_kernel" method will be the entry of image loading work including
kernel and firmware.

This change is back compatible. So, we can either load firmware through
"-kernel" like before or "-firmware".

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2022-08-31 08:32:08 +01:00
Jianyong Wu
2054c8699a vmm:AArch64: add load_firmware method
Later, we will load standalone firmware. So, refactor load_kernel
by abstracting load_firmware method.

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2022-08-31 08:32:08 +01:00
Sebastien Boeuf
8c02648ac9 vmm: device_manager: Update virtio-console for proper PTY support
Given the virtio-console is now able to buffer its output when no PTY is
connected on the other end, the device manager code is updated to enable
this. Moving the endpoint type from FilePair to PtyPair enables the
proper codepath in the virtio-console implementation, as well as
updating the PTY resize code, and forcing the PTY to always be
non-blocking.

The non-blocking behavior is required to avoid blocking the guest that
would be waiting on the virtio-console driver. When receiving an
EWOULDBLOCK error, the output will simply be redirected to the temporary
buffer so that it can be later flushed.

The PTY resize logic has been slightly modified to ensure the PTY file
descriptors are closed. It avoids the child process to keep a hold onto
the PTY device, which would have caused the PTY to believe something is
connected on the other end, which would have prevented the detection of
any new connection on the PTY.

Fixes #4521

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-08-30 13:47:51 +02:00
Sebastien Boeuf
a940f525a8 vmm: Move SerialBuffer to its own crate
We want to be able to reuse the SerialBuffer from the virtio-devices
crate, particularly from the virtio-console implementation. That's why
we move the SerialBuffer definition to its own crate so that it can be
accessed from both vmm and virtio-devices crates, without creating any
cyclic dependency.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-08-30 13:47:51 +02:00
Sebastien Boeuf
63462fd8ab vmm: serial_manager: Iterate again on EINTR
If the epoll_wait() call returns EINTR, this only means a signal has
been delivered before any of the file descriptors registered triggered
an event or before the end of the timeout (if timeout isn't -1). For
that reason, we should simply try to listen on the epoll loop again.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2022-08-30 13:47:51 +02:00