The way to create ACPI tables for TDX is different as each table must be
passed through the HOB. This means the XSDT table is not required since
the firmware will take care of creating it. Same for RSDP, this is
firmware responsibility to provide it to the guest.
That's why this patch creates a TDX dedicated function, returning a list
of Sdt objects, which will let the calling code copy the content of each
table through the HOB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case of TDX, we don't want to create the ACPI tables the same way we
do for all the other use cases. That's because the ACPI tables don't
need to be written to guest memory at a specific address, instead they
are passed directly through the HOB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's been decided the ACPI tables will be passed to the firmware in a
different way, rather than using TD_VMM_DATA. Since TD_VMM_DATA was
introduced for this purpose, there's no reason to keep it in our
codebase.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This table is listed as required in the ARM Base Boot Requirements
document. The particular need arises to make the serial debugging of
Windows guest functional.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Ensure all pending virtio activations (as triggered by MMIO write on the
vCPU threads leading to a barrier wait) are completed before pausing the
vCPUs as otherwise there will a deadlock with the VMM waiting for the
vCPU to acknowledge it's pause and the vCPU waiting for the VMM to
activate the device and release the barrier.
Fixes: #3585
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
These are the leftovers from the commit 8155be2:
arch: aarch64: vm_memory is not required when configuring vcpu
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
When in local migration mode send the FDs for the guest memory over the
socket along with the slot that the FD is associated with. This removes
the requirement for copying the guest RAM and gives significantly faster
live migration performance (of the order of 3s to 60ms).
Fixes: #3566
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the VM using the FDs (wrapped in Files) that have been received
during the migration process.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If this FD (wrapped in a File) is supplied when the RAM region is being
created use that over creating a new one.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This function is used to open an FD (wrapped in a File) that points to
guest memory from memfd_create() or backed on the filesystem.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Drop the unused parameter throughout the code base.
Also take the chance to drop a needless clone.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
We've been currently using "fd" as the field name, but it should be
called "fds" since 6664e5a6e7 introduced
the name change on the structure field.
Fixes: #3560
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
`tap` has its default value set to `None`, but in the openapi yaml file
we've been setting it to `""`.
When using this code on the Kata Containers side we'd be hit by a non
expected behaviour of cloud-hypervisor, as even when using a different
method to initialise the `tuntap` device the code would be treated as if
using `--net tap` (which is a valid use-case).
Related: #3554
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
warning: unnecessary use of `to_string`
--> vmm/src/config.rs:2199:38
|
2199 | ... .get(&memory_zone.to_string())
| ^^^^^^^^^^^^^^^^^^^^^^^^ help: use: `memory_zone`
|
= note: `#[warn(clippy::unnecessary_to_owned)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_to_owned
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: unneeded late initalization
--> vmm/src/acpi.rs:525:5
|
525 | let mut prev_tbl_len: u64;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(clippy::needless_late_init)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_late_init
help: declare `prev_tbl_len` here
|
552 | let mut prev_tbl_len: u64 = madt.len() as u64;
| ~~~~~~~~~~~~~~~~~~~~~~~~~
warning: unneeded late initalization
--> vmm/src/acpi.rs:526:5
|
526 | let mut prev_tbl_off: GuestAddress;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_late_init
help: declare `prev_tbl_off` here
|
553 | let mut prev_tbl_off: GuestAddress = madt_offset;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This crate was used in the integration tests to allow the tests to
continue and clean up after a failure. This isn't necessary in the unit
tests and adds a large build dependency chain including an unmaintained
crate.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the underlying kernel is old PTY resize is disabled and this is
represented by the use of None in the provided Option<File> type. In the
virtio-console PTY path don't blindly unwrap() the value that will be
preserved across a reboot.
Fixes: #3496
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It's important to maintain the ability to run in a non-TDX environment
a Cloud Hypervisor binary with the 'tdx' feature enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Setting up the SIGWINCH handler requires at least Linux 5.7. However
this functionality is not required for basic PTY operation.
Fixes: #3456
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Advertise the PCI MMIO config spaces here so that the MMIO config space
is correctly recognised.
Tested by: --platform num_pci_segments=1 or 16 hotplug NVMe vfio-user device
works correctly with hypervisor-fw & OVMF and direct kernel boot.
Fixes: #3432
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When restoring replace the internal value of the device tree rather than
replacing the Arc<Mutex<DeviceTree>> itself. This is fixes an issue
where the AddressManager has a copy of the the original
Arc<Mutex<DeviceTree>> from when the DeviceManager was created. The
original restore path only replaced the DeviceManager's version of the
Arc<Mutex<DeviceTree>>. Instead replace the contents of the
Arc<Mutex<DeviceTree>> so all users see the updated version.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Saving the KVM clock and restoring it is key for correct behaviour of
the VM when doing snapshot/restore or live migration. The clock is
restored to the KVM state as part of the Vm::resume() method prior to
that it must be extracted from the state object and stored for later use
by this method. This change simplifies the extraction and storage part
so that it is done in the same way for both snapshot/restore and live
migration.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The constant `PCI_MMIO_CONFIG_SIZE` defined in `vmm/pci_segment.rs`
describes the MMIO configuation size for each PCI segment. However,
this name conflicts with the `PCI_MMCONFIG_SIZE` defined in `layout.rs`
in the `arch` crate, which describes the memory size of the PCI MMIO
configuration region.
Therefore, this commit renames the `PCI_MMIO_CONFIG_SIZE` to
`PCI_MMIO_CONFIG_SIZE_PER_SEGMENT` and moves this constant from `vmm`
crate to `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Currently, a tuple containing PCI space start address and PCI space
size is used to pass the PCI space information to the FDT creator.
In order to support the multiple PCI segment for FDT, more information
such as the PCI segment ID should be passed to the FDT creator. If we
still use a tuple to store these information, the code flexibility and
readablity will be harmed.
To address this issue, this commit replaces the tuple containing the
PCI space information to a structure `PciSpaceInfo` and uses a vector
of `PciSpaceInfo` to store PCI space information for each segment, so
that multiple PCI segment information can be passed to the FDT together.
Note that the scope of this commit will only contain the refactor of
original code, the actual multiple PCI segments support will be in
following series, and for now `--platform num_pci_segments` should only
be 1.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In order to avoid the identity map region to conflict with a possible
firmware being placed in the last 4MiB of the 4GiB range, we must set
the address to a chosen location. And it makes the most sense to have
this region placed right after the TSS region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Place the 3 page TSS at an explicit location in the 32-bit address space
to avoid conflicting with the loaded raw firmware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Added fields:
- `Memory address size limit`: the missing of this field triggered
warnings in guest kernel
- `Node ID`
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
After introducing multiple PCI segments, the `devid` value in
`kvm_irq_routing_entry` exceeds the maximum supported range on AArch64.
This commit restructed the `devid` to the allowed range.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
If the provided binary isn't an ELF binary assume that it is a firmware
to be loaded in directly. In this case we shouldn't program any of the
registers as KVM starts in that state.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
That function call can return -1 when it fails. Wrapping -1 into File
causes the code to panic when the File is dropped.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
I encountered some trouble trying to use a virtio-console hooked up to
a PTY. Reading from the PTY would produce stuff like this
"\n\nsh-5.1# \n\nsh-5.1# " (where I'm just pressing enter at a shell
prompt), and a terminal would render that like this:
----------------------------------------------------------------
sh-5.1#
sh-5.1#
----------------------------------------------------------------
This was because we weren't disabling the ICRNL termios iflag, which
turns carriage returns (\r) into line feeds (\n). Other raw mode
implementations (like QEMU's) set this flag, and don't have this
problem.
Instead of fixing our raw mode implementation to just disable ICRNL,
or copy the flags from QEMU's, though, here I've changed it to use the
raw mode implementation in libc. It seems to work correctly in my
testing, and means we don't have to worry about what exactly raw mode
looks like under the hood any more.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Fix seccomp violation when trying to add the out FD to the epoll loop
when the serial buffer needs to be flushed.
0x00007ffff7dc093e in epoll_ctl () at ../sysdeps/unix/syscall-template.S:120
0x0000555555db9b6d in epoll::ctl (epfd=56, op=epoll::ControlOptions::EPOLL_CTL_MOD, fd=55, event=...)
at /home/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/epoll-4.3.1/src/lib.rs:155
0x00005555556f5127 in vmm::serial_buffer::SerialBuffer::add_out_poll (self=0x7fffe800b5d0) at vmm/src/serial_buffer.rs:101
0x00005555556f583d in vmm::serial_buffer::{impl#1}::write (self=0x7fffe800b5d0, buf=...) at vmm/src/serial_buffer.rs:139
0x0000555555a30b10 in std::io::Write::write_all<vmm::serial_buffer::SerialBuffer> (self=0x7fffe800b5d0, buf=...)
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1527
0x0000555555ab82fb in devices::legacy::serial::Serial::handle_write (self=0x7fffe800b520, offset=0, v=13) at devices/src/legacy/serial.rs:217
0x0000555555ab897f in devices::legacy::serial::{impl#2}::write (self=0x7fffe800b520, _base=1016, offset=0, data=...) at devices/src/legacy/serial.rs:295
0x0000555555f30e95 in vm_device:🚌:Bus::write (self=0x7fffe8006ce0, addr=1016, data=...) at vm-device/src/bus.rs:235
0x00005555559406d4 in vmm::vm::{impl#4}::pio_write (self=0x7fffe8009640, port=1016, data=...) at vmm/src/vm.rs:459
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When running with `--serial pty --console pty --seccomp=false` the
SIGWICH listener thread would panic as the seccomp filter was empty.
Adopt the mechanism used in the rest of the code and check for non-empty
filter before trying to apply it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the introduction of a new option `affinity` to the `cpus`
parameter, Cloud Hypervisor can now let the user choose the set
of host CPUs where to run each vCPU.
This is useful when trying to achieve CPU pinning, as well as making
sure the VM runs on a specific NUMA node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Give the option parser the ability to handle tuples with inner brackets
containing list of integers. The following example can now be handled
correctly "option=[key@[v1-v2,v3,v4]]" which means the option is
assigned a tuple with a key associated with a list of integers between
the range v1 - v2, as well as v3 and v4.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Always properly initialize vectors so that we don't run in undefined
behaviors when the vector gets dropped.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Creates a new generic type Tuple so that the same implementation of
FromStr trait can be reused for both parsing a list of two integers and
parsing a list of one integer associated with a list of integers.
This anticipates the need for retrieving sublists, which will be needed
when trying to describe the host CPU affinity for every vCPU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The elements of a list should be using commas as the correct delimiter
now that it is supported. Deprecate use of colons as delimiter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This allocator allocates 64-bit MMIO addresses for use with platform
devices e.g. ACPI control devices and ensures there is no overlap with
PCI address space ranges which can cause issues with PCI device
remapping.
Use this allocator the ACPI platform devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than use the system MMIO allocator for RAM use an allocator that
covers the full RAM range.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is because the SGX region will be placed between the end of ram and
the start of the device area.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the segment id now encoded in the bdf it is not necessary to have
the separate field for it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In particular use the accessor for getting the device id from the bdf.
As a side effect the VIOT table is now segment aware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have a non-overlapping memory range associated
with it the device memory must be equally divided amongst all segments.
A new allocator is used for each segment to ensure that BARs are
allocated from the correct address ranges. This requires changes to
PciDevice::allocate/free_bars to take that allocator and when
reallocating BARs the correct allocator must be identified from the
ranges.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For all the devices that support being hotplugged (disk, net, pmem, fs
and vsock) add "pci_segment" option and propagate that through to the
addition onto the PCI busses.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the decision on whether to use a 64-bit bar up to the DeviceManager
so that it can use both the device type (e.g. block) and the PCI segment
ID to decide what size bar should be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Generate a set of 8 IRQs and round-robin distribute those over all the
slots for a bus. This same set of IRQs is then used for all PCI
segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The platform config may specify a number of PCI segments to use, if this
greater than 1 then we add supplemental PCI segments as well as the
default segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This currently contains only the number over PCI segments to create.
This is limited to 16 at the moment which should allow 496 user specified
PCI devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For the bus scanning the GED AML code now calls into a PSCN method that
scans all buses. This approach was chosen since it handles the case
correctly where one GED interrupt is services for two hotplugs on
distinct segments.
The PCIU and PCID field values are now determined by the PSEG field that
is uses to select which segment those values should be used for.
Similarly _EJ0 will notify based on the value of _SEG.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Replace the hardcoded zero PCI segment id when adding devices to the bus
and extend the DeviceTree to hold the PCI segment id.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have disjoint address spaces only advertise
address space in the 32-bit range and the PIO address space on the
default (zero) segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Split PciSegment::new_default_segment() into a separate
PciSegment::new() and those parts required only for the default segment
(PIO PCI config device.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For now this still contains just one segment but is expanding in
preparation for more segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit moves the code that generates the DSDT data for the PCI bus
into PciSegment making no functional changes to the generated AML.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Variables that start with underscore are used to silence rustc.
Normally those variables are not used in code.
This patch drops the underscore from variables that are used. This is
less confusing to readers.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This makes it much easier to use since the info!() level produces far
fewer messages and thus has less overhead.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Relying on the vm-virtio/virtio-queue crate from rust-vmm which has been
copied inside the Cloud Hypervisor tree, the entire codebase is moved to
the new definition of a Queue and other related structures.
The reason for this move is to follow the upstream until we get some
agreement for the patches that we need on top of that to make it
properly work with Cloud Hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When allocating PCI MMIO BARs they should always be naturally aligned
(i.e. aligned to the size of the BAR itself.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The allocate_bars method has a side effect which collates the BARs used
for the device and stores them internally. Ensure that any use of this
internal state is after the state is created otherwise no MMIO regions
will be seen and so none will be mapped.
Fixes: #3237
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of creating a MemoryManager from scratch, let's reuse the same
code path used by snapshot/restore, so that memory regions are created
identically to what they were on the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that all the pieces are in place, we can restore a VM with the new
codepath that restores properly all memory regions, allowing for ACPI
memory hotplug to work properly with snapshot/restore feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extending the MemoryManager::new() function to be able to create a
MemoryManager from data that have been previously stored instead of
always creating everything from scratch.
This change brings real added value as it allows a VM to be restored
respecting the proper memory layout instead of hoping the regions will
be created the way they were before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Storing multiple data coming from the MemoryManager in order to be able
to restore without creating everything from scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new function will be able to restore memory regions and memory
zones based on the GuestMemoryMapping list that will be provided through
snapshot/restore and migration phases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This can help identifying which zone relates to which memory range.
This is going to be useful when recreating GuestMemory regions from
the previous layout instead of having to recreate everything from
scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a dedicated function to factorize the allocation of the memory
ranges, and helping with the simplification of MemoryManager::new()
function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By updating the list of GuestMemory regions with the virtio-mem ones
before the creation of the MemoryManager, we know the GuestMemory is up
to date and the allocation of memory ranges is simplified afterwards.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to simplify MemoryManager::new() function. let's move the
memory configuration validation to its own function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Move the PciSegment struct and the associated code to a new file. This
will allow some clearer separation between the core DeviceManager and
PCI handling.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the PCI related state from the DeviceManager struct to a PciSegment
struct inside the DeviceManager. This is in preparation for multiple
segment support. Currently this state is just the bus itself, the MMIO
and PIO config devices and hotplug related state.
The main change that this required is using the Arc<Mutex<PciBus>> in
the device addition logic in order to ensure that
the bus could be created earlier.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When using PVH for booting (which we use for all firmwares and direct
kernel boot) the Linux kernel does not configure LA57 correctly. As such
we need to limit the address space to the maximum 4-level paging address
space.
If the user knows that their guest image can take advantage of the
5-level addressing and they need it for their workload then they can
increase the physical address space appropriately.
This PR removes the TDX specific handling as the new address space limit
is below the one that that code specified.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever running TDX, we must pass the ACPI tables to the TDVF firmware
running in the guest. The proper way to do this is by adding the tables
to the TdHob as a TdVmmData type, so that TDVF will know how to access
these tables and expose them to the guest OS.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of having the ACPI tables being created both in x86_64 and
aarch64 implementations of configure_system(), we can remove the
duplicated code by moving the ACPI tables creation in vm.rs inside the
boot() function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The argument `prefault` is provided in MemoryManager, but it can
only be used by SGX and restore.
With prefault (MAP_POPULATE) been set, subsequent page faults will
decrease during running, although it will make boot slower.
This commit adds `prefault` in MemoryConfig and MemoryZoneConfig.
To resolve conflict between memory and restore, argument
`prefault` has been changed from `bool` to `Option<bool>`, when
its value is None, config from memory will be used, otherwise
argument in Option will be used.
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
By using a single file for storing the memory ranges, we simplify the
way snapshot/restore works by avoiding multiples files, but the main and
more important point is that we have now a way to save only the ranges
that matter. In particular, the ranges related to virtio-mem regions are
not always fully hotplugged, meaning we don't want to save the entire
region. That's where the usage of memory ranges is interesting as it
lets us optimize the snapshot/restore process when one or multiple
virtio-mem regions are involved.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The function memory_range_table() will be reused by the MemoryManager in
a following patch to describe all the ranges that we should snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Copy only the memory ranges that have been plugged through virtio-mem,
allowing for an interesting optimization regarding the time it takes to
migrate a large virtio-mem device. Even if the hotpluggable space is
very large (say 64GiB), if only 1GiB has been previously added to the
VM, only 1GiB will be sent to the destination VM, avoiding the transfer
of the remaining 63GiB which are unused.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By creating the BlocksState object in the MemoryManager, we can directly
provide it to the virtio-mem device when being created. This will allow
the MemoryManager through each VirtioMemZone to have a handle onto the
blocks that are plugged at any point in time.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This will be helpful to support the creation of a MemoryRangeTable from
virtio-mem, as it uses 2M pages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding the snapshot/restore support along with migration as well,
allowing a VM with virtio-mem devices attached to be properly
migrated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The amount of memory plugged in the virtio-mem region should always be
kept up to date in the hotplugged_size field from VirtioMemZone.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no need to duplicate the GuestMemory for snapshot purpose, as we
always have a handle onto the GuestMemory through the guest_memory
field.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we only support a single PCI bus right now advertise only a single
bus in the ACPI tables. This reduces the number of VM exits from probing
substantially.
Number of PCI config I/O port exits: 17871 -> 1551 (91% reduction) with
direct kernel boot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a simpler method for extracting the affected slot on the eject
command. Also update the terminology to reflect that this a slot rather
than a bdf (which is what device id refers to elsewhere.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Refactor the serial buffer handling in order to write the serial
buffer's output to a PTY connected after the serial device stops being
written to by the guest.
This change moves the serial buffer initialization inside the serial
manager. That is done to allow the serial buffer to be made aware of
the PTY and epoll fds needed in order to modify the
EpollDispatch::File trigger. These are then used by the serial buffer
to trigger an epoll event when the PTY fd is writable and the buffer
has content in it. They are also used to remove the trigger when the
buffer is emptied in order to avoid unnecessary wake-ups.
Signed-off-by: William Douglas <william.douglas@intel.com>
Both read_exact_from() and write_all_to() functions from the GuestMemory
trait implementation in vm-memory are buggy. They should retry until
they wrote or read the amount of data that was expected, but instead
they simply return an error when this happens. This causes the migration
to fail when trying to send important amount of data through the
migration socket, due to large memory regions.
This should be eventually fixed in vm-memory, and here is the link to
follow up on the issue: https://github.com/rust-vmm/vm-memory/issues/174
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement the infrastructure that lets a virtio-mem device map the guest
memory into the device. This is necessary since with virtio-mem zones
memory can be added or removed and the vfio-user device must be
informed.
Fixes: #3025
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By moving this from the VfioUserPciDevice to DeviceManager the client
can be reused for handling DMA mapping behind an IOMMU.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For vfio-user the mapping handler is per device and needs to be removed
when the device in unplugged.
For VFIO the mapping handler is for the default VFIO container (used
when no vIOMMU is used - using a vIOMMU does not require mappings with
virtio-mem)
To represent these two use cases use an enum for the handlers that are
stored.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Looking up devices on the port I/O bus is time consuming during the
boot at there is an O(lg n) tree lookup and the overhead from taking a
lock on the bus contents.
Avoid this by adding a fast path uses the hardcoded port address and
size and directs PCI config requests directly to the device.
Command line:
target/release/cloud-hypervisor --kernel ~/src/linux/vmlinux --cmdline "root=/dev/vda1 console=ttyS0" --serial tty --console off --disk path=~/workloads/focal-server-cloudimg-amd64-custom-20210609-0.raw --api-socket /tmp/api
PIO exit: 17913
PCI fast path: 17871
Percentage on fast path: 99.8%
perf before:
marvin:~/src/cloud-hypervisor (main *)$ perf report -g | grep resolve
6.20% 6.20% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
perf after:
marvin:~/src/cloud-hypervisor (2021-09-17-ioapic-fast-path *)$ perf report -g | grep resolve
0.08% 0.08% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
The compromise required to implement this fast path is bringing the
creation of the PciConfigIo device into the DeviceManager::new() so that
it can be used in the VmmOps struct which is created before
DeviceManager::create_devices() is called.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The MSI IOVA address on X86 and AArch64 is different.
This commit refactored the code to receive the MSI IOVA address and size
from device_manager, which provides the actual IOVA space data for both
architectures.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Add a virtio-iommu node into FDT if iommu option is turned on. Now we
support only one virtio-iommu device.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This change switches from handling serial input in the VMM thread to
its own thread controlled by the SerialManager.
The motivation for this change is to avoid the VMM thread being unable
to process events while serial input is happening and vice versa.
The change also makes future work flushing the serial buffer on PTY
connections easier.
Signed-off-by: William Douglas <william.douglas@intel.com>
This change adds a SerialManager with its own epoll handling that
should be created and run by the DeviceManager when creating an
appropriately configured console (serial tty or pty).
Both stdin and pty input are handled by the SerialManager. The stdin
and pty specific methods used by the VMM should be removed in a future
commit.
Signed-off-by: William Douglas <william.douglas@intel.com>
The clone method for PtyPair should have been an impl of the Clone
trait but the method ended up not being used. Future work will make
use of the trait however so correct the missing trait implementation.
Signed-off-by: William Douglas <william.douglas@intel.com>
For most use cases, there is no need to create multiple VFIO containers
as it causes unwanted behaviors. Especially when passing multiple
devices from the same IOMMU group, we need to use the same container so
that it can properly list the groups that have been already opened. The
correct logic was already there in vfio-ioctls, but it was incorrectly
used from our VMM implementation.
For the special case where we put a VFIO device behind a vIOMMU, we must
create one container per device, as we need to control the DMA mappings
per device, which is performed at the container level. Because we must
keep one container per device, the vIOMMU use case prevents multiple
devices attached to the same IOMMU group to be passed through the VM.
But this is a limitation that we are fine with, especially since the
vIOMMU doesn't let us group multiple devices in the same group from a
guest perspective.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize. This is the only way to be notified
by the kernel of a pty resize.
We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in. To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty. The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal. This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.
Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity. I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.
I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process. The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This prepares us to be able to handle console resizes in the console
device's epoll loop, which we'll have to do if the output is a pty,
since we won't get SIGWINCH from it.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Musl often uses mmap to allocate memory where Glibc would use brk.
This has caused seccomp violations for me on the API and signal
handling threads.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
error: all if blocks contain the same code at the end
--> vmm/src/memory_manager.rs:884:9
|
884 | / Ok(mm)
885 | | }
| |_________^
Signed-off-by: Bo Chen <chen.bo@intel.com>
This concept ends up being broken with multiple types on input connected
e.g. console on TTY and serial on PTY. Already the code for checking for
injecting into the serial device checks that the serial is configured.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a common solution for spawning the virtio threads which will
make it easier to add the panic handling.
During this effort I discovered that there were no seccomp filters
registered for the vhost-user-net thread nor the vhost-user-block
thread. This change also incorporates basic seccomp filters for those as
part of the refactoring.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Current AArch64 power button is only for device tree using a PL061
GPIO controller device. Since AArch64 now supports ACPI, this
commit extend the power button on AArch64 to:
- Using GED for ACPI+UEFI boot.
- Using PL061 for device tree boot.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
These statements are useful for understanding the cause of reset or
shutdown of the VM and are not spammy so should be included at info!()
level.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Despite setting up a dedicated thread for signal handling, we weren't
making sure that the signals we were listening for there were actually
dispatched to the right thread. While the signal-hook provides an
iterator API, so we can know that we're only processing the signals
coming out of the iterator on our signal handling thread, the actual
signal handling code from signal-hook, which pushes the signals onto
the iterator, can run on any thread. This can lead to seccomp
violations when the signal-hook signal handler does something that
isn't allowed on that thread by our seccomp policy.
To reproduce, resize a terminal running cloud-hypervisor continuously
for a few minutes. Eventually, the kernel will deliver a SIGWINCH to
a thread with a restrictive seccomp policy, and a seccomp violation
will trigger.
As part of this change, it's also necessary to allow rt_sigreturn(2)
on the signal handling thread, so signal handlers are actually allowed
to run on it. The fact that this didn't seem to be needed before
makes me think that signal handlers were almost _never_ actually
running on the signal handling thread.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Move the processing of the input from stdin, PTY or file from the VMM
thread to the existing virtio-console thread. The handling of the resize
of a virtio-console has not changed but the name of the struct used to
support that has been renamed to reflect its usage.
Fixes: #3060
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Downcasting of GicDevice trait might fail. Therefore we try to
downcast the trait first and only if the downcasting succeeded we
can then use the object to call methods. Otherwise, do nothing and
log the failure.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements the GIC (including both GICv3 and GICv3ITS)
Pausable trait. The pause of device manager will trigger a "pause"
of GIC, where we flush GIC pending tables and ITS tables to the
guest RAM.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This prevents the boot of the guest kernel from being blocked by
blocking I/O on the serial output since the data will be buffered into
the SerialBuffer.
Fixes: #3004
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a dynamic buffer for storing output from the serial port. The
SerialBuffer implements std::io::Write and can be used in place of the
direct output for the serial device.
The internals of the buffer is a vector that grows dynamically based on
demand up to a fixed size at which point old data will be overwritten.
Currently the buffer is only flushed upon writes.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove the indirection of a dispatch table and simply use the enum as
the event data for the events.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use two separate events for the console and serial PTY and then drive
the handling of the inputs on the PTY separately. This results in the
correct behaviour when both console and serial are attached to the PTY
as they are triggered separately on the epoll so events are not lost.
Fixes: #3012
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Check the config to find out which device is attached to the tty and
then send the input from the user into that device (serial or
virtio-console.)
Fixes: #3005
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
vhdx_sync.rs in block_util implements traits to represent the vhdx
crate as a supported block device in the cloud hypervisor. The vhdx
is added to the block device list in device_manager.rs at the vmm
crate so that it can automatically detect a vhdx disk and invoke the
corresponding crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Fazla Mehrab <akm.fazla.mehrab@intel.com>
We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.
Signed-off-by: Bo Chen <chen.bo@intel.com>
It is forbidden that the same memory zone belongs to more than one
NUMA node. This commit adds related validation to the `--numa`
parameter to prevent the user from specifying such configuration.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The optional device tree node distance-map describes the relative
distance (memory latency) between all NUMA nodes.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This is to make sure the NUMA node data structures can be accessed
both from the `vmm` crate and `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
The AArch64 platform provides a NUMA binding for the device tree,
which means on AArch64 platform, the NUMA setup can be extended to
more than the ACPI feature.
Based on above, this commit extends the NUMA setup and data
structures to following scenarios:
- All AArch64 platform
- x86_64 platform with ACPI feature enabled
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Signed-off-by: Michael Zhao <Michael.Zhao@arm.com>
Instead of panicking with an expect() function, the QcowDiskSync::new
function now propagates the error properly. This ensures the VMM will
not panic, which might be the source of weird errors if only one thread
exits while the VMM continues to run.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We cannot let vhost-user devices connect to the backend when the Block,
Fs or Net object is being created during a restore/migration. The reason
is we can't have two VMs (source and destination) connected to the same
backend at the same time. That's why we must delay the connection with
the vhost-user backend until the restoration is performed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code wasn't doing what it was expected to. The '?' was simply
returning the error to the top level function, meaning the Err() case in
the match was never hit. Moving the whole logic to a dedicated function
allows to identify when something got wrong without propagating to the
calling function, so that we can still stop the dirty logging and
unpause the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case the migration succeeds, the destination VM will be correctly
running, with potential vhost-user backends attached to it. We can't let
the source VM trying to reconnect to the same backends, which is why
it's safer to shutdown the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for creating vhost-user devices in a different way when
being restored compared to a fresh start, this commit introduces a new
boolean created by the Vm depending on the use case, and passed down to
the DeviceManager. In the future, the DeviceManager will use this flag
to assess how vhost-user devices should be created.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Correct operation of user devices (vfio-user) requires shared memory so
flag this to prevent it from failing in strange ways.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the vfio-user / user devices from the config. Currently hotplug
of the devices is not supported nor can they be placed behind the
(virt-)iommu.
Removal of the coldplugged device is however supported.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the user to specify devices that are running in a different
userspace process and communicated with vfio-user.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Allow vsocks to connect to Unix sockets on the host running
cloud-hypervisor with enabled seccomp.
Reported-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Tested-by: Franz Girlich <franz.girlich@tu-ilmenau.de>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
The optional Processor Properties Topology Table (PPTT) table is
used to describe the topological structure of processors controlled
by the OSPM, and their shared resources, such as caches. The table
can also describe additional information such as which nodes in the
processor topology constitute a physical package.
The ACPI PPTT table supports topology descriptions for ACPI guests.
Therefore, this commit adds the PPTT table for AArch64 to enable
CPU topology feature for ACPI.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In an Arm system, the hierarchy of CPUs is defined through three
entities that are used to describe the layout of physical CPUs in
the system:
- cluster
- core
- thread
All these three entities have their own FDT node field. Therefore,
This commit adds an AArch64-specific helper to pass the config from
the Cloud Hypervisor command line to the `configure_system`, where
eventually the `create_fdt` is called.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Make sure the DeviceManager is triggered for all migration operations.
The dirty pages are merged from MemoryManager and DeviceManager before
to be sent up to the Vmm in lib.rs.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that Migratable provides the methods for starting, stopping and
retrieving the dirty pages, we move the existing code to these new
functions.
No functional change intended.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch connects the dots between the vm.rs code and each Migratable
device, in order to make sure Migratable methods are correctly invoked
when migration happens.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for supporting the merge of multiple dirty pages coming
from multiple devices, this patch factorizes the creation of a
MemoryRangeTable from a bitmap, as well as providing a simple method for
merging the dirty pages regions under a single MemoryRangeTable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch adds all the seccomp rules missing for MSHV.
With this patch MSFT internal CI runs with seccomp enabled.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch adds a fallback path for sending live migration, where it
ensures the following behavior of source VM post live-migration:
1. The source VM will be paused only when the migration is completed
successfully, or otherwise it will keep running;
2. The source VM will always stop dirty pages logging.
Fixes: #2895
Signed-off-by: Bo Chen <chen.bo@intel.com>
This rule is needed to boot windows guest.
This bug was introduced while we tried to boot
windows guest on MSHV.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
This patch modify the existing live migration code
to support MSHV. Adds couple of new functions to enable
and disable dirty page tracking. Add missing IOCTL
to the seccomp rules for live migration.
Adds necessary flags for MSHV.
This changes don't affect KVM functionality at all.
In order to get better performance it is good to
enable dirty page tracking when we start live migration
and disable it when the migration is done.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
Right now, get_dirty_log API has two parameters,
slot and memory_size.
MSHV needs gpa to retrieve the page states. GPA is
needed as MSHV returns the state base on PFN.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
It's totally acceptable to snapshot and restore a virtio-fs device that
has no cache region, since this is a valid mode of functioning for
virtio-fs itself.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
As we are now using an global control to start/stop dirty pages log from
the `hypervisor` crate, we need to explicitly tell the hypervisor (KVM)
whether a region needs dirty page tracking when it is created.
This reverts commit f063346de3.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Following KVM interfaces, the `hypervisor` crate now provides interfaces
to start/stop the dirty pages logging on a per region basis, and asks
its users (e.g. the `vmm` crate) to iterate over the regions that needs
dirty pages log. MSHV only has a global control to start/stop dirty
pages log on all regions at once.
This patch refactors related APIs from the `hypervisor` crate to provide
a global control to start/stop dirty pages log (following MSHV's
behaviors), and keeps tracking the regions need dirty pages log for
KVM. It avoids leaking hypervisor-specific behaviors out of the
`hypervisor` crate.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch adds a common function "Vmm::vm_check_cpuid_compatibility()"
to be shared by both live-migration and snapshot/restore.
Signed-off-by: Bo Chen <chen.bo@intel.com>
We now send not only the 'VmConfig' at the 'Command::Config' step of
live migration, but also send the 'common CPUID'. In this way, we can
check the compatibility of CPUID features between the source and
destination VMs, and abort live migration early if needed.
Signed-off-by: Bo Chen <chen.bo@intel.com>
With the support of dynamically turning on/off dirty-pages-log during
live-migration (only for guest RAM regions), we now can create guest
memory regions without dirty-pages-log by default both for guest RAM
regions and other regions backed by file/device.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This patch extends slightly the current live-migration code path with
the ability to dynamically start and stop logging dirty-pages, which
relies on two new methods added to the `hypervisor::vm::Vm` Trait. This
patch also contains a complete implementation of the two new methods
based on `kvm` and placeholders for `mshv` in the `hypervisor` crate.
Fixes: #2858
Signed-off-by: Bo Chen <chen.bo@intel.com>
Whenever a file descriptor is sent through the control message, it
requires fcntl() syscall to handle it, meaning we must allow it through
the list of syscalls authorized for the HTTP thread.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running TDX guest, the Guest Physical Address space is limited by
a shared bit that is located on bit 47 for 4 level paging, and on bit 51
for 5 level paging (when GPAW bit is 1). In order to keep things simple,
and since a 47 bits address space is 128TiB large, we ensure to limit
the physical addressable space to 47 bits when runnning TDX.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When running a TDX guest, we need the virtio drivers to use the DMA API
to share specific memory pages with the VMM on the host. The point is to
let the VMM get access to the pages related to the buffers pointed by
the virtqueues.
The way to force the virtio drivers to use the DMA API is by exposing
the virtio devices with the feature VIRTIO_F_IOMMU_PLATFORM. This is a
feature indicating the device will require some address translation, as
it will not deal directly with physical addresses.
Cloud Hypervisor takes care of this requirement by adding a generic
parameter called "force_iommu". This parameter value is decided based on
the "tdx" feature gate, and then passed to the DeviceManager. It's up to
the DeviceManager to use this parameter on every virtio device creation,
which will imply setting the VIRTIO_F_IOMMU_PLATFORM feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This refactoring ensures all CPUID related operations are centralized in
`arch::x86_64` module, and exposes only two related public functions to
the vmm crate, e.g. `generate_common_cpuid` and `configure_vcpu`.
Signed-off-by: Bo Chen <chen.bo@intel.com>
In order to let a separate process open a TAP device and pass the file
descriptor through the control message mechanism, this patch adds the
support for sending a file descriptor over to the Cloud Hypervisor
process along with the add-net HTTP API command.
The implementation uses the NetConfig structure mutably to update the
list of fds with the one passed through control message. The list should
always be empty prior to this, as it makes no sense to provide a list of
fds once the Cloud Hypervisor process has already been started.
It is important to note that reboot is supported since the file
descriptor is duplicated upon receival, letting the VM only use the
duplicated one. The original file descriptor is kept open in order to
support a potential reboot.
Fixes#2525
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The micro-http crate now uses recvmsg() syscall in order to receive file
descriptors through control messages. This means the syscall must be
part of the authorized list in the seccomp filters.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new option allows the user to define a list of SGX EPC sections
attached to a specific NUMA node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to uniquely identify each SGX EPC section, we introduce a
mandatory option `id` to the `--sgx-epc` parameter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The guest can see that SGX supports provisioning as it is exposed
through the CPUID. This patch enables the proper backing of this
feature by having the host open the provisioning device and enable
this capability through the hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch fixes a few things to support TDVF correctly.
The HOB memory resources must contain EFI_RESOURCE_ATTRIBUTE_ENCRYPTED
attribute.
Any section with a base address within the already allocated guest RAM
must not be allocated.
The list of TD_HOB memory resources should contain both TempMem and
TdHob sections as well.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Previously the same function was used to both create and remove regions.
This worked on KVM because it uses size 0 to indicate removal.
MSHV has two calls -- one for creation and one for removal. It also
requires having the size field available because it is not slot based.
Split set_user_memory_region to {create/remove}_user_memory_region. For
KVM they still use set_user_memory_region underneath, but for MSHV they
map to different functions.
This fixes user memory region removal on MSHV.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
The to-be-introduced MSHV rules don't need to contain KVM rules and vice
versa.
Put KVM constants into to a module. This avoids the warnings about
dead code in the future.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This commit introduces the `ProcessorGiccAffinity` struct for the
AArch64 platform. This struct will be created and included into
the SRAT table to enable AArch64 NUMA setup.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
It ensures all handlers for `ApiRequest` in `control_loop` are
consistent and minimum and should read better.
No functional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
It simplifies a bit the `Vmm::control_loop` and reads better to be
consistent with other `ApiRequest` handlers. Also, it removes the
repetitive `ApiError::VmAlreadyCreated` and makes `ApiError::VmCreate`
useful.
No functional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
We have been building Cloud Hypervisor with command like:
`cargo build --no-default-features --features ...`.
After implementing ACPI, we donot have to use specify all features
explicitly. Default build command `cargo build` can work.
This commit fixed some build warnings with default build option and
changed github workflow correspondingly.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
error: avoid using `collect()` when not needed
--> vmm/src/vm.rs:630:86
|
630 | let node_id_list: Vec<u32> = configs.iter().map(|cfg| cfg.guest_numa_id).collect();
| ^^^^^^^
...
664 | if !node_id_list.contains(&dest) {
| ---------------------------- the iterator could be used here instead
|
= note: `-D clippy::needless-collect` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_collect
Signed-off-by: Bo Chen <chen.bo@intel.com>
Issue from beta verion of clippy:
Error: --> vm-virtio/src/queue.rs:700:59
|
700 | if let Some(used_event) = self.get_used_event(&mem) {
| ^^^^ help: change this to: `mem`
|
= note: `-D clippy::needless-borrow` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
Signed-off-by: Bo Chen <chen.bo@intel.com>
Issue from beta verion of clippy:
error: field is never read: `type`
--> vmm/src/cpu.rs:235:5
|
235 | pub r#type: u8,
| ^^^^^^^^^^^^^^
|
= note: `-D dead-code` implied by `-D warnings`
Signed-off-by: Bo Chen <chen.bo@intel.com>
The Linux kernel expects that any PCI devices that advertise I/O bars
have use an address that is within the range advertised by the bus
itself. Unfortunately we were not advertising any I/O ports associated
with the PCI bus in the ACPI tables.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to allow a hotplugged vCPU to be assigned to the correct NUMA
node in the guest, the DSDT table must expose the _PXM method for each
vCPU. This method defines the proximity domain to which each vCPU should
be attached to.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The _PXM method always return 0, which is wrong since the SRAT might
tell differently. The point of the _PXM method is to be evaluated by the
guest OS when some new memory slot is being plugged, but this will never
happen for Cloud Hypervisor since using NUMA nodes along with memory
hotplug only works for virtio-mem.
Memory hotplug through ACPI will only happen when there's only one NUMA
node exposed to the guest, which means the _PXM method won't be needed
at all.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make sure the unique PCI bus is tied to the default NUMA node 0, and
update the documentation to let the users know about this special case.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Sometimes we need balloon deflate automatically to give memory
back to guest, especially for some low priority guest processes
under memory pressure. Enable deflate_on_oom to support this.
Usage: --balloon "size=0,deflate_on_oom=on" \
Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Since using the VIRTIO configuration to expose the virtual IOMMU
topology has been deprecated, the virtio-iommu implementation must be
updated.
In order to follow the latest patchset that is about to be merged in the
upstream Linux kernel, it must rely on ACPI, and in particular the newly
introduced VIOT table to expose the information about the list of PCI
devices attached to the virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implemented an architecture specific function for loading UEFI binary.
Changed the logic of loading kernel image:
1. First try to load the image as kernel in PE format;
2. If failed, try again to load it as formatless UEFI binary.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
_DSM (Device Specific Method) is a control method that enables devices
to provide device specific control functions. Linux kernel will evaluate
this device then initialize preserve_config in acpi pci initialization.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Live migration currently handles guest memory writes from the guest
through the KVM dirty page tracking and sends those dirty pages to the
destination. This patch augments the live migration support with dirty
page tracking of writes from the VMM to the guest memory(e.g. virtio
devices).
Fixes: #2458
Signed-off-by: Bo Chen <chen.bo@intel.com>
Function "GuestMemory::with_regions(_mut)" were mainly temporary methods
to access the regions in `GuestMemory` as the lack of iterator-based
access, and hence they are deprecated in the upstream vm-memory crate [1].
[1] https://github.com/rust-vmm/vm-memory/issues/133
Signed-off-by: Bo Chen <chen.bo@intel.com>
As the first step to complete live-migration with tracking dirty-pages
written by the VMM, this commit patches the dependent vm-memory crate to
the upstream version with the dirty-page-tracking capability. Most
changes are due to the updated `GuestMemoryMmap`, `GuestRegionMmap`, and
`MmapRegion` structs which are taking an additional generic type
parameter to specify what 'bitmap backend' is used.
The above changes should be transparent to the rest of the code base,
e.g. all unit/integration tests should pass without additional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
After adding "get_interrupt_controller()" function in DeviceManager,
"enable_interrupt_controller()" became redundant, because the latter
one is the a simple wrapper on the interrupt controller.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
The function used to calculate "gicr-typer" value has nothing with
DeviceManager. Now it is moved to AArch64 specific files.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
We thought we could move the control queue to the backend as it was
making some good sense. Unfortunately, doing so was a wrong design
decision as it broke the compatibility with OVS-DPDK backend.
This is why this commit moves the control queue back to the VMM side,
meaning an additional thread is being run for handling the communication
with the guest.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
On FDT, VMM can allocate IRQ from 0 for devices.
But on ACPI, the lowest range below 32 has to be avoided.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This commit enables the PSCI (Power State Coordination Interface)
for the AArch64 platform, which allows the VMM to manage the power
status of the guest. Also, multiple vCPUs can be brought up using
PSCI.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements the IO Remapping Table (IORT) for AArch64.
The IORT is one of the required ACPI table for AArch64, since
it describes the GICv3ITS node.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements an AArch64-required ACPI table: Serial
Port Console Redirection Table (SPCR). The table provides
information about the configuration and use of the serial port
or non-legacy UART interface.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
This commit implements an AArch64-specific ACPI table: Generic
Timer Description Table (GTDT). The GTDT provides OSPM with
information about a system’s Generic Timers configuration.
The Generic Timer (GT) is a standard timer interface implemented
on ARM processor-based systems.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Added the final PCI bus number in MCFG table. This field is mandatory on
AArch64. On X86 it is optional.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
Simplified definition block of CPU's on AArch64. It is not complete yet.
Guest boots. But more is to do in future:
- Fix the error in ACPI definition blocks (seen in boot messages)
- Implement CPU hot-plug controller
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
In migration, vm object is created by new_from_migration with
NULL kvm clock. so vm.set_clock will not be called during vm resume.
If the guest using kvm-clock, the ticks will be stopped after migration.
As clock was already saved to snapshot, add a method to restore it before
vm resume in migration. after that, guest's kvm-clock works well.
Signed-off-by: Ren Lei <ren.lei4@zte.com.cn>
Connecting a restored KVM clock vm will take long time, as clock
is NOT restored immediately after vm resume from snapshot.
this is because 9ce6c3b incorrectly remove vm_snapshot.clock, and
always pass None to new_from_memory_manager, which will result to
kvm_set_clock() never be called during restore from snapshot.
Fixes: 9ce6c3b
Signed-off-by: Ren Lei <ren.lei4@zte.com.cn>
Now that the control queue is correctly handled by the backend, there's
no need to handle it as well from the VMM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Before this change, the FDT was loaded at the end of RAM. The address of
FDT was not fixed.
While UEFI (edk2 now) requires fixed address to find FDT and RSDP.
Now the FDT is moved to the beginning of RAM, which is a fixed address.
RSDP is wrote to 2 MiB after FDT, also a fixed address.
Kernel comes 2 MiB after RSDP.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
These messages are predominantly during the boot process but will also
occur during events such as hotplug.
These cover all the significant steps of the boot and can be helpful for
diagnosing performance and functionality issues during the boot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Separate the population of the memory and the HOB from the TDX
initialisation of the memory so that the latter can happen after the CPU
is initialised.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now all crates use edition = "2018" then the majority of the "extern
crate" statements can be removed. Only those for importing macros need
to remain.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the support for an OVS vhost-user backend to connect as the
vhost-user client. This means we introduce with this patch a new
option to our `--net` parameter. This option is called 'server' in order
to ask the VMM to run as the server for the vhost-user socket.
Fixes#1745
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a temporary copy of the config, add the new device and validate
that. This needs to be done separately to adding it to the config to
avoid race conditions that might be result in config changes being
overwritten.
Fixes: #2564
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
To handle that devices are stored in an Option<Vec<T>> and reduce
duplicated code use generic function to add the devices to the the
struct.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The latest kvm-sgx code has renamed sgx_virt_epc device node
to sgx_vepc. Update cloud-hypervisor code and documentation to
follow this.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Because the http thread no longer needs to create the api socket,
remove the socket, bind and listen syscalls from the seccomp filter.
Signed-off-by: William Douglas <william.douglas@intel.com>
Instead of using the http server's method to have it create the
fd (causing the http thread to need to support the socket, bind and
listen syscalls). Create the socket fd in the vmm thread and use the
http server's new method supporting passing in this fd for the api
socket.
Signed-off-by: William Douglas <william.douglas@intel.com>
To avoid race issues where the api-socket may not be created by the
time a cloud-hypervisor caller is ready to look for it, enable the
caller to pass the api-socket fd directly.
Avoid breaking current callers by allowing the --api-socket path to be
passed as it is now in addition to through the path argument.
Signed-off-by: William Douglas <william.r.douglas@gmail.com>
In order to support using Versionize for state structures it is necessary
to use simpler, primitive, data types in the state definitions used for
snapshot restore.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Windows guests read this field upon PCI device ejection. Let's make sure
we don't return an error as this is valid. We simply return an empty u32
since the ejection is done right away upon write access, which means
there's no pending ejection that might be reported to the guest.
Here is the error that was shown during PCI device removal:
ERROR:vmm/src/device_manager.rs:3960 -- Accessing unknown location at
base 0x7ffffee000, offset 0x8
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of tracking on a block level of 64 pages, we are now collecting
dirty pages one by one. It improves the efficiency of dirty memory
tracking while live migration.
Signed-off-by: Bo Chen <chen.bo@intel.com>
By disabling this KVM feature, we prevent the guest from using APF
(Asynchronous Page Fault) mechanism. The kernel has recently switched to
using interrupts to notify about a page being ready, but for some
reasons, this is causing unexpected behavior with Cloud Hypervisor, as
it will make the vcpu thread spin at 100%.
While investigating the issue, it's better to disable the KVM feature to
prevent 100% CPU usage in some cases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The original code had a generic type E. It was later replaced by a
concrete type. The code should have been simplified when the replacement
happened.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
The MCRS method returns a 64-bit memory range descriptor. The
calculation is supposed to be done as follows:
max = min + len - 1
However, every operand is represented not as a QWORD but as combination
of two DWORDs for high and low part. Till now, the calculation was done
this way, please see also inline comments:
max.lo = min.lo + len.lo //this may overflow, need to carry over to high
max.hi = min.hi + len.hi
max.hi = max.hi - 1 // subtraction needs to happen on the low part
This calculation has been corrected the following way:
max.lo = min.lo + len.lo
max.hi = min.hi + len.hi + (max.lo < min.lo) // check for overflow
max.lo = max.lo - 1 // subtract from low part
The relevant part from the generated ASL for the MCRS method:
```
Method (MCRS, 1, Serialized)
{
Acquire (MLCK, 0xFFFF)
\_SB.MHPC.MSEL = Arg0
Name (MR64, ResourceTemplate ()
{
QWordMemory (ResourceProducer, PosDecode, MinFixed, MaxFixed, Cacheable, ReadWrite,
0x0000000000000000, // Granularity
0x0000000000000000, // Range Minimum
0xFFFFFFFFFFFFFFFE, // Range Maximum
0x0000000000000000, // Translation Offset
0xFFFFFFFFFFFFFFFF, // Length
,, _Y00, AddressRangeMemory, TypeStatic)
})
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._MIN, MINL) // _MIN: Minimum Base Address
CreateDWordField (MR64, 0x12, MINH)
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._MAX, MAXL) // _MAX: Maximum Base Address
CreateDWordField (MR64, 0x1A, MAXH)
CreateQWordField (MR64, \_SB.MHPC.MCRS._Y00._LEN, LENL) // _LEN: Length
CreateDWordField (MR64, 0x2A, LENH)
MINL = \_SB.MHPC.MHBL
MINH = \_SB.MHPC.MHBH
LENL = \_SB.MHPC.MHLL
LENH = \_SB.MHPC.MHLH
MAXL = (MINL + LENL) /* \_SB_.MHPC.MCRS.LENL */
MAXH = (MINH + LENH) /* \_SB_.MHPC.MCRS.LENH */
If ((MAXL < MINL))
{
MAXH += One /* \_SB_.MHPC.MCRS.MAXH */
}
MAXL -= One
Release (MLCK)
Return (MR64) /* \_SB_.MHPC.MCRS.MR64 */
}
```
Fixes#1800.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
Fixes the current codebase so that every cargo clippy can be run with
the beta toolchain without any error.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are two parts:
- Unconditionally zero the output area. The length of the incoming
vector has been seen from 1 to 4 bytes, even though just the first
byte might need to be handled. But also, this ensures any possibly
unhandled offset will return zeroed result to the caller. The former
implementation used an I/O port which seems to behave differently from
MMIO and wouldn't require explicit output zeroing.
- An access with zero offset still takes place and needs to be handled.
Fixes#2437.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
The following is from the Hyper-V specification v6.0b.
Cpuid leaf 0x40000003 EDX:
Bit 3: Support for physical CPU dynamic partitioning events is
available.
When Windows determines to be running under a hypervisor, it will
require this cpuid bit to be set to support dynamic CPU operations.
Cpuid leaf 0x40000004 EAX:
Bit 5: Recommend using relaxed timing for this partition. If
used, the VM should disable any watchdog timeouts that
rely on the timely delivery of external interrupts.
This bit has been figured out as required after seeing guest BSOD
when CPU hotplug bit is enabled. Race conditions seem to arise after a
hotplug operation, when a system watchdog has expired.
Closes#1799.
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
error: name `GPIOInterruptDisabled` contains a capitalized acronym
Error: --> devices/src/legacy/gpio_pl061.rs:46:5
|
46 | GPIOInterruptDisabled,
| ^^^^^^^^^^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `GpioInterruptDisabled`
|
= note: `-D clippy::upper-case-acronyms` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `FinalizeTDX` contains a capitalized acronym
--> vmm/src/vm.rs:274:5
|
274 | FinalizeTDX(hypervisor::HypervisorVmError),
| ^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `FinalizeTdx`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `AcpiPMTimerDevice` contains a capitalized acronym
--> devices/src/acpi.rs:175:12
|
175 | pub struct AcpiPMTimerDevice {
| ^^^^^^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `AcpiPmTimerDevice`
|
= note: `#[warn(clippy::upper_case_acronyms)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `IORegion` contains a capitalized acronym
--> pci/src/configuration.rs:320:5
|
320 | IORegion = 0x01,
| ^^^^^^^^ help: consider making the acronym lowercase, except the initial letter (notice the capitalization): `IoRegion`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `LocalAPIC` contains a capitalized acronym
--> vmm/src/cpu.rs:197:8
|
197 | struct LocalAPIC {
| ^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `LocalApic`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `TYPE_UNKNOWN` contains a capitalized acronym
--> vm-virtio/src/lib.rs:48:5
|
48 | TYPE_UNKNOWN = 0xFF,
| ^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `Type_Unknown`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `SDT` contains a capitalized acronym
--> acpi_tables/src/sdt.rs:27:12
|
27 | pub struct SDT {
| ^^^ help: consider making the acronym lowercase, except the initial letter: `Sdt`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Drop the generic type E and use IrqRoutngEntry directly. This allows
dropping a bunch of trait bounds from code.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
Their make_entry functions look the same now. Extract the logic to a
common function.
No functional change.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
There's no need to have the code creating the passthrough_device being
duplicated since we can factorize it in a function used in both cases
(both cold plugged and hot plugged devices VFIO devices).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend and use the existing DeviceTree to retrieve useful information
related to PCI devices. This removes the duplication with pci_devices
field which was internal to the DeviceManager.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make the code a bit clearer by changing the naming of the structure
holding the list of IRQs reserved for PCI devices. It is also modified
into an array of 32 entries since we know this is the amount of PCI
slots that is supported.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We define a new enum in order to classify PCI device under virtio or
VFIO. This is a cleaner approach than using the Any trait, and
downcasting it to find the object back.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduces a tuple holding both information needed by pci_id_list and
pci_devices.
Changes pci_devices to be a BTreeMap of this new tuple.
Now that pci_devices holds the information needed from pci_id_list,
pci_id_list is no longer needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for further factorization, the pci_id_list is now a
hashmap of PCI b/d/f leading to each device name.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of relying on a PCI specific device list, we use the DeviceTree
as a reference to determine if a device name is already in use or not.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Only if we have a valid API server path then create the API server. For
now this has no functional change there is a default API server path in
the clap handling but rather prepares to do so optionally.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>