In this way, we have all functions related to generate default values of
vm-config structs in the same location.
Signed-off-by: Bo Chen <chen.bo@intel.com>
These have been replaced by members of PayloadConfig and should be
removed in v28.0 (mentioned in v26.0 release notes.)
Fixes: #4737
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is consistent when considering that some structs have a
`#[derive(Default)`] so it makes sense for the default implementations
to be in the same location.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Place the data structures that are required for constructing a VmConfig
into it's own module from the logic that exists to suppot them.
This is useful as a consumer of the API can now clearly see what data
structures make up the API for creating VMs.
This has no functional change and I made no attempt to clean up the
ordering (it's as in the original file) nor any other clean up.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Bumps [clap](https://github.com/clap-rs/clap) from 3.2.22 to 4.0.9.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v3.2.22...v4.0.9)
---
updated-dependencies:
- dependency-name: clap
dependency-type: direct:production
update-type: version-update:semver-major
...
Moving to the major version 4 introduced some breaking changes which had
to be handled manually.
Fixes#4709
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This option is needed for the openapi consumer (e.g. Kata Containers) to
load firmware (e.g. td-shim) for booting.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This simplifies the CI process but also logical with the existing
functionality under "guest_debug" (dumping guest memory).
Fixes: #4679
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the support for the user to set the MTU for the vhost-user-net
backend, which allows the integration test to be extended with the test
of the MTU parameter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adjust MTU logic such that:
1. Apply an MTU to the TAP interface if the user supplies it
2. Always query the TAP interface for the MTU and expose that.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This simplifes the buld and checks with very little overhead and the
fwdebug device is I/O port device on 0x402 that can be used by edk2 as a
very simple character device.
See: #4679
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add tracing of the VM boot sequence from the point at which the request
to create a VM is received to the hand-off to the vCPU threads running.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a new feature "tracing" that enables tracing functionality via the
"tracer" crate (sadly features and crates cannot share the same name.)
Setup: tracer::start()
The main functionality is a tracer::trace_scope()! macro that will add
trace points for the duration of the scope. Tracing events are per
thread.
Finish: tracer::end() this will write the trace file (pretty printed
JSON) to a file in the current directory.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a new "mtu" parameter to the NetConfig structure and therefore to
the --net option. This allows Cloud Hypervisor's users to define the
Maximum Transmission Unit (MTU) they want to use for the network
interface that they create.
In details, there are two main aspects. On the one hand, the TAP
interface is created with the proper MTU if it is provided. And on the
other hand the guest is made aware of the MTU through the VIRTIO
configuration. That means the MTU is properly set on both the TAP on the
host and the network interface in the guest.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no need to delegate the resize operation to the virtio-mem
thread. This can come directly from the vmm thread which will use the
Mem object to update the VIRTIO configuration and trigger the interrupt
for the guest to be notified.
In order to achieve what's described above, the VirtioMemZone structure
now has a handle onto the Mem object directly. This avoids the need for
intermediate Resize and ResizeSender structures.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Given the AMX x86 feature has been made available since kernel v5.17,
and given we don't have any test validating this feature, there's no
need to keep it behing a Rust feature gate.
Fixes#3996
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Multiple rust-vmm crates must be updated at once given the vm-memory one
has been updated and they all rely on vm-memory.
- vm-memory from 0.8.0 to 0.9.0
- vhost from 0.4.0 to 0.5.0
- virtio-queue from 0.5.0 to 0.6.0
- vhost-user-backend from 0.6.0 to 0.7.0
- linux-loader from 0.4.0 to 0.5.0
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Removing the option --tdx to specify that we want to run a TD VM. Rely
on --platform option by adding the "tdx" boolean parameter. This is the
new way for enabling TDX with Cloud Hypervisor.
Along with this change, the way to retrieve the firmware path has been
updated to rely on the recently introduced PayloadConfig structure.
Fixes#4556
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The PCI buses should not declare the address space related to the MMIO
config space given it's already declared in the MCFG table and through
the motherboard device PNP0C02 in the DSDT table.
The PCI MMIO config region for the segment was being wrongly exposed as
part of the _CRS for the ACPI bus device (using Memory32Fixed). Exposing
it via this object was ineffectual as the equivalent entry in the
PNP0C02 (_SB_.MBRD) marked those ranges as not usable via the kernel.
Either way, with both devices used by the kernel, the kernel will not
try and use those memory ranges for the device BARs. However under
td-shim on TDX the PNP0C02 device is not on the permitted list of
devices so the the memory ranges were not marked as unusable resulting
in the kernel attempting to allocate BARs that collided with the PCI
MMIO configuration space.
This is based on the kernel documentation PCI/acpi-info.rst which relies
on ACPI and PCI Firmware specifications. And here are the interesting
quotes from this document:
"""
Prior to the addition of Extended Address Space descriptors, the failure
of Consumer/Producer meant there was no way to describe bridge registers
in the PNP0A03/PNP0A08 device itself. The workaround was to describe the
bridge registers (including ECAM space) in PNP0C02 catch-all devices.
With the exception of ECAM, the bridge register space is device-specific
anyway, so the generic PNP0A03/PNP0A08 driver (pci_root.c) has no need
to know about it.
PNP0C02 “motherboard” devices are basically a catch-all. There’s no
programming model for them other than “don’t use these resources for
anything else.” So a PNP0C02 _CRS should claim any address space that is
(1) not claimed by _CRS under any other device object in the ACPI
namespace and (2) should not be assigned by the OS to something else.
The address range reported in the MCFG table or by _CBA method (see
Section 4.1.3) must be reserved by declaring a motherboard resource. For
most systems, the motherboard resource would appear at the root of the
ACPI namespace (under _SB) in a node with a _HID of EISAID (PNP0C02),
and the resources in this case should not be claimed in the root PCI
bus’s _CRS. The resources can optionally be returned in Int15 E820 or
EFIGetMemoryMap as reserved memory but must always be reported through
ACPI as a motherboard resource.
"""
This change has been manually tested by running a VM with multiple
segments (4 segments), and by hotplugging an additional disk to the
segment number 2 (third segment).
From one shell:
"""
cloud-hypervisor \
--cpus boot=1 \
--memory size=1G \
--kernel vmlinux \
--cmdline "root=/dev/vda1 rw console=hvc0" \
--disk path=jammy-server-cloudimg.raw \
--api-socket /tmp/ch.sock \
--platform num_pci_segments=4
"""
From another shell (after the VM is booted):
"""
ch-remote \
--api-socket=/tmp/ch.sock \
add-disk \
path=test-disk.raw,id=disk2,pci_segment=2
"""
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Use VgicConfig to initialize Vgic.
Use Gic::create_default_config everywhere so we don't always recompute
redist/msi registers.
Add a helper create_test_vgic_config for tests in hypervisor crate.
Signed-off-by: Nuno Das Neves <nudasnev@microsoft.com>
AArch64 can share the same way of loading payload with x86_64. It makes
the payload loading more consistent between different arches.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
uefi_flash is used when load firmware, that is load payload depends on
device manager. move uefi_flash to memory manager can eliminate the
dependency.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
A new firmware item has been added into payload config, we need
extend ability to load standalone firmware on AArch64.
"load_kernel" method will be the entry of image loading work including
kernel and firmware.
This change is back compatible. So, we can either load firmware through
"-kernel" like before or "-firmware".
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Later, we will load standalone firmware. So, refactor load_kernel
by abstracting load_firmware method.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Given the virtio-console is now able to buffer its output when no PTY is
connected on the other end, the device manager code is updated to enable
this. Moving the endpoint type from FilePair to PtyPair enables the
proper codepath in the virtio-console implementation, as well as
updating the PTY resize code, and forcing the PTY to always be
non-blocking.
The non-blocking behavior is required to avoid blocking the guest that
would be waiting on the virtio-console driver. When receiving an
EWOULDBLOCK error, the output will simply be redirected to the temporary
buffer so that it can be later flushed.
The PTY resize logic has been slightly modified to ensure the PTY file
descriptors are closed. It avoids the child process to keep a hold onto
the PTY device, which would have caused the PTY to believe something is
connected on the other end, which would have prevented the detection of
any new connection on the PTY.
Fixes#4521
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We want to be able to reuse the SerialBuffer from the virtio-devices
crate, particularly from the virtio-console implementation. That's why
we move the SerialBuffer definition to its own crate so that it can be
accessed from both vmm and virtio-devices crates, without creating any
cyclic dependency.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If the epoll_wait() call returns EINTR, this only means a signal has
been delivered before any of the file descriptors registered triggered
an event or before the end of the timeout (if timeout isn't -1). For
that reason, we should simply try to listen on the epoll loop again.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We must limit how much the buffer can grow, otherwise this could lead
the process to consume all the memory on the machine. This could happen
if the output from the guest was very important and nothing would
connect to the PTY for a long time.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Set the maximum number of HW breakpoints according to the value returned
from `Hypervisor::get_guest_debug_hw_bps()`.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
On AArch64, `translate_gva` API is not provided by KVM. We implemented
it in VMM by walking through translation tables.
Address translation is big topic, here we only focus the scenario that
happens in VMM while debugging kernel. This `translate_gva`
implementation is restricted to:
- Exception Level 1
- Translate high address range only (kernel space)
This implementation supports following Arm-v8a features related to
address translation:
- FEAT_LPA
- FEAT_LVA
- FEAT_LPA2
The implementation supports page sizes of 4KiB, 16KiB and 64KiB.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The goal of this patch is to provide a reliable way to detect when the
other end of the PTY is connected, and therefore be able to identify
when we can write to the PTY device. This is needed because writing to
the PTY device when the other end isn't connected causes the loss of
the written bytes.
The way to detect the connection on the other end of the PTY is by
knowing the other end is disconnected at first with the presence of the
EPOLLHUP event. Later on, when the connection happens, EPOLLHUP is not
triggered anymore, and that's when we can assume it's okay to write to
the PTY main device.
It's important to note we had to ensure the file descriptor for the
other end was closed, otherwise we would have never seen the EPOLLHUP
event. And we did so by removing the "sub" field from the PtyPair
structure as it was keeping the associated File opened.
Fixes#3170
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since our firmware files are still designed to be used via PVH use the
load_kernel() function to load the firmware falling back to legacy
firmware loading if necessary.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding new I/O ports for both the ACPI shutdown and the ACPI PM timer
devices so they can be triggered from both addresses. The reason for
this change is that TDX expects only certain I/O ports to be enabled
based on what QEMU exposes. We follow this to avoid new ports from being
opened exclusively for Cloud Hypervisor.
We have to keep the former I/O ports available given all firmwares
haven't been updated yet. Once we reach a point where we know both Rust
Hypervisor Firmware, OVMF, TDVF and TDSHIM have been updated with the
new port values, we'll be able to remove the former ports.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The old API remains usable, and will remain usable for two releases but
we should only advertise the new API.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a new top level member of VmConfig called PayloadConfig that
(currently) encompasses the kernel, commandline and initramfs for the
guest to use.
In future this can be extended for firmware use. The existing
"--kernel", "--cmdline" and "initramfs" CLI parameters now fill the
PayloadConfig.
Any config supplied which uses the now deprecated config members have
those members mapped to the new version with a warning.
See: #4445
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By checking in the validation logic we get checking for both devices
specified in the initial config but also hotplug too.
Fixes: #4453
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The uuid indicates the unique ID of a virtual machine.
cloud-hypervisor takes the uuid passed by libvirt
and uses it to initialize cloud-init.
Signed-off-by: lizhaoxin1 <Lxiaoyouling@163.com>
The parameter "poll_queue" was useful at the time Cloud Hypervisor was
responsible for spawning vhost-user backends, as it was carrying the
information the vhost-user-block backend should have this option enabled
or not.
It's been quite some time that we walked away from this design, as we
now expect a management layer to be responsible for running vhost-user
backends.
That's the reason why we can remove "poll_queue" from the DiskConfig
structure.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The new virtio-queue version introduced some breaking changes which need
to be addressed so that Cloud Hypervisor can still work with this
version.
The most important change is about removing a handle to the guest memory
from the Queue, meaning the caller has to provide the guest memory
handle for multiple methods from the QueueT trait.
One interesting aspect is that QueueT has been widely extended to
provide every getter and setter we need to access and update the Queue
structure without having direct access to its internal fields.
This patch ports all the virtio and vhost-user devices to this new crate
definition. It also updates both vhost-user-block and vhost-user-net
backends based on the updated vhost-user-backend crate. It also updates
the fuzz directory.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When starting the VM such that it is already on a breakpoint (via
stop_on_boot) when attached to gdb then start the vCPUs in a paused
state rather than starting the vCPUs later (upon resume).
Further, make the resumption/break of the VM more resilient by only
attempting to resume the vCPUs if were are already in a break point and
only attempting to pause/break if we were already running.
Fixes: #4354
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove the hardcoded addresses.
Also remove PM_TMR_BLK as spec compliant implementation will use
X_PM_TMR_BLK over this field.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The original code uses kvm_device_attr directly outside of the
hyeprvisor crate. That leaks hypervisor details.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This requires making get/set_lapic_reg part of the type.
For the moment we cannot provide a default variant for the new type,
because picking one will be wrong for the other hypervisor, so I just
drop the test cases that requires LapicState::default().
Signed-off-by: Wei Liu <liuwe@microsoft.com>
CpuId is an alias type for the flexible array structure type over
CpuIdEntry. The type itself and the type of the element in the array
portion are tied to the underlying hypervisor.
Switch to using CpuIdEntry slice or vector directly. The construction of
CpuId type is left to hypervisors.
This allows us to decouple CpuIdEntry from hypervisors more easily.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
We only need to do this for x86 since MSHV does not have aarch64 support
yet. This reduces unnecessary code churn.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
VmState was introduced to hold hypervisor specific VM state. KVM does
not need it and MSHV does not really use it yet.
Just drop the code. It can be easily revived once there is a need.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Previously, we were assuming that every time an eventfd notified us,
there was only a single event waiting for us. This meant that if,
while one API request was being processed, two more arrived, the
second one would not be processed (until the next one arrived, when it
would be processed instead of that event, and so on). To fix this,
make sure we're processing the number of API and debug requests we've
been told have arrived, rather than just one. This is easy to
demonstrate by sending lots of API events and adding some sleeps to
make sure multiple events can arrive while each is being processed.
For other uses of eventfd, like the exit event, this doesn't matter —
even if we've received multiple exit events in quick succession, we
only need to exit once. So I've only made this change where receiving
an event is non-idempotent, i.e. where it matters that we process the
event the right number of times.
Technically, reset requests are also non-idempotent — there's an
observable difference between a VM resetting once, and a VM resetting
once and then immediately resetting again. But I've left that alone
for now because two resets in immediate succession doesn't sound like
something anyone would ever want to me.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Function `system_registers` took mutable vector reference and modified
the vector content. Now change the definition to `get/set` style.
And rename to `get/set_sys_regs` to align with other functions.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
On AArch64, the function `core_registers` and `set_core_registers` are
the same thing of `get/set_regs` on x86_64. Now the names are aligned.
This will benefit supporting `gdb`.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The VM specific signal (currently only SIGWINCH) should only be handled
when the VM is running.
The generic VMM signals (SIGINT and SIGTERM) need handling at all times.
Split the signal handling into two separate threads which have differing
lifetimes.
Tested by:
1.) Boot full VM and check resize handling (SIGWINCH) works & sending
SIGTERM leads to cleanup (tested that API socket is removed.)
2.) Start without a VM and send SIGTERM/SIGINT and observe cleanup (API
socket removed)
3.) Boot full VM, delete VM and observe 2.) holds.
4.) Boot full VM, delete VM, recreate VM and observe 1.) holds.
Fixes: #4269
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
And along with virtio-queue, we must also bump vhost-user-backend from
0.3.0 to 0.5.0 (since it relies on virtio-queue 0.4.0).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The Linux kernel now checks for this before marking CPUs as
hotpluggable:
commit aa06e20f1be628186f0c2dcec09ea0009eb69778
Author: Mario Limonciello <mario.limonciello@amd.com>
Date: Wed Sep 8 16:41:46 2021 -0500
x86/ACPI: Don't add CPUs that are not online capable
A number of systems are showing "hotplug capable" CPUs when they
are not really hotpluggable. This is because the MADT has extra
CPU entries to support different CPUs that may be inserted into
the socket with different numbers of cores.
Starting with ACPI 6.3 the spec has an Online Capable bit in the
MADT used to determine whether or not a CPU is hotplug capable
when the enabled bit is not set.
Link: https://uefi.org/htmlspecs/ACPI_Spec_6_4_html/05_ACPI_Software_Programming_Model/ACPI_Software_Programming_Model.html?#local-apic-flags
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>