Setting up the SIGWINCH handler requires at least Linux 5.7. However
this functionality is not required for basic PTY operation.
Fixes: #3456
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Advertise the PCI MMIO config spaces here so that the MMIO config space
is correctly recognised.
Tested by: --platform num_pci_segments=1 or 16 hotplug NVMe vfio-user device
works correctly with hypervisor-fw & OVMF and direct kernel boot.
Fixes: #3432
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When restoring replace the internal value of the device tree rather than
replacing the Arc<Mutex<DeviceTree>> itself. This is fixes an issue
where the AddressManager has a copy of the the original
Arc<Mutex<DeviceTree>> from when the DeviceManager was created. The
original restore path only replaced the DeviceManager's version of the
Arc<Mutex<DeviceTree>>. Instead replace the contents of the
Arc<Mutex<DeviceTree>> so all users see the updated version.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Saving the KVM clock and restoring it is key for correct behaviour of
the VM when doing snapshot/restore or live migration. The clock is
restored to the KVM state as part of the Vm::resume() method prior to
that it must be extracted from the state object and stored for later use
by this method. This change simplifies the extraction and storage part
so that it is done in the same way for both snapshot/restore and live
migration.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The constant `PCI_MMIO_CONFIG_SIZE` defined in `vmm/pci_segment.rs`
describes the MMIO configuation size for each PCI segment. However,
this name conflicts with the `PCI_MMCONFIG_SIZE` defined in `layout.rs`
in the `arch` crate, which describes the memory size of the PCI MMIO
configuration region.
Therefore, this commit renames the `PCI_MMIO_CONFIG_SIZE` to
`PCI_MMIO_CONFIG_SIZE_PER_SEGMENT` and moves this constant from `vmm`
crate to `arch` crate.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Currently, a tuple containing PCI space start address and PCI space
size is used to pass the PCI space information to the FDT creator.
In order to support the multiple PCI segment for FDT, more information
such as the PCI segment ID should be passed to the FDT creator. If we
still use a tuple to store these information, the code flexibility and
readablity will be harmed.
To address this issue, this commit replaces the tuple containing the
PCI space information to a structure `PciSpaceInfo` and uses a vector
of `PciSpaceInfo` to store PCI space information for each segment, so
that multiple PCI segment information can be passed to the FDT together.
Note that the scope of this commit will only contain the refactor of
original code, the actual multiple PCI segments support will be in
following series, and for now `--platform num_pci_segments` should only
be 1.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In order to avoid the identity map region to conflict with a possible
firmware being placed in the last 4MiB of the 4GiB range, we must set
the address to a chosen location. And it makes the most sense to have
this region placed right after the TSS region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Place the 3 page TSS at an explicit location in the 32-bit address space
to avoid conflicting with the loaded raw firmware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Added fields:
- `Memory address size limit`: the missing of this field triggered
warnings in guest kernel
- `Node ID`
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
After introducing multiple PCI segments, the `devid` value in
`kvm_irq_routing_entry` exceeds the maximum supported range on AArch64.
This commit restructed the `devid` to the allowed range.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
If the provided binary isn't an ELF binary assume that it is a firmware
to be loaded in directly. In this case we shouldn't program any of the
registers as KVM starts in that state.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
That function call can return -1 when it fails. Wrapping -1 into File
causes the code to panic when the File is dropped.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
I encountered some trouble trying to use a virtio-console hooked up to
a PTY. Reading from the PTY would produce stuff like this
"\n\nsh-5.1# \n\nsh-5.1# " (where I'm just pressing enter at a shell
prompt), and a terminal would render that like this:
----------------------------------------------------------------
sh-5.1#
sh-5.1#
----------------------------------------------------------------
This was because we weren't disabling the ICRNL termios iflag, which
turns carriage returns (\r) into line feeds (\n). Other raw mode
implementations (like QEMU's) set this flag, and don't have this
problem.
Instead of fixing our raw mode implementation to just disable ICRNL,
or copy the flags from QEMU's, though, here I've changed it to use the
raw mode implementation in libc. It seems to work correctly in my
testing, and means we don't have to worry about what exactly raw mode
looks like under the hood any more.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Fix seccomp violation when trying to add the out FD to the epoll loop
when the serial buffer needs to be flushed.
0x00007ffff7dc093e in epoll_ctl () at ../sysdeps/unix/syscall-template.S:120
0x0000555555db9b6d in epoll::ctl (epfd=56, op=epoll::ControlOptions::EPOLL_CTL_MOD, fd=55, event=...)
at /home/rob/.cargo/registry/src/github.com-1ecc6299db9ec823/epoll-4.3.1/src/lib.rs:155
0x00005555556f5127 in vmm::serial_buffer::SerialBuffer::add_out_poll (self=0x7fffe800b5d0) at vmm/src/serial_buffer.rs:101
0x00005555556f583d in vmm::serial_buffer::{impl#1}::write (self=0x7fffe800b5d0, buf=...) at vmm/src/serial_buffer.rs:139
0x0000555555a30b10 in std::io::Write::write_all<vmm::serial_buffer::SerialBuffer> (self=0x7fffe800b5d0, buf=...)
at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1527
0x0000555555ab82fb in devices::legacy::serial::Serial::handle_write (self=0x7fffe800b520, offset=0, v=13) at devices/src/legacy/serial.rs:217
0x0000555555ab897f in devices::legacy::serial::{impl#2}::write (self=0x7fffe800b520, _base=1016, offset=0, data=...) at devices/src/legacy/serial.rs:295
0x0000555555f30e95 in vm_device:🚌:Bus::write (self=0x7fffe8006ce0, addr=1016, data=...) at vm-device/src/bus.rs:235
0x00005555559406d4 in vmm::vm::{impl#4}::pio_write (self=0x7fffe8009640, port=1016, data=...) at vmm/src/vm.rs:459
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When running with `--serial pty --console pty --seccomp=false` the
SIGWICH listener thread would panic as the seccomp filter was empty.
Adopt the mechanism used in the rest of the code and check for non-empty
filter before trying to apply it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the introduction of a new option `affinity` to the `cpus`
parameter, Cloud Hypervisor can now let the user choose the set
of host CPUs where to run each vCPU.
This is useful when trying to achieve CPU pinning, as well as making
sure the VM runs on a specific NUMA node.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Give the option parser the ability to handle tuples with inner brackets
containing list of integers. The following example can now be handled
correctly "option=[key@[v1-v2,v3,v4]]" which means the option is
assigned a tuple with a key associated with a list of integers between
the range v1 - v2, as well as v3 and v4.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because anyhow version 1.0.46 has been yanked, let's move back to the
previous version 1.0.45.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Always properly initialize vectors so that we don't run in undefined
behaviors when the vector gets dropped.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Creates a new generic type Tuple so that the same implementation of
FromStr trait can be reused for both parsing a list of two integers and
parsing a list of one integer associated with a list of integers.
This anticipates the need for retrieving sublists, which will be needed
when trying to describe the host CPU affinity for every vCPU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The elements of a list should be using commas as the correct delimiter
now that it is supported. Deprecate use of colons as delimiter.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This allocator allocates 64-bit MMIO addresses for use with platform
devices e.g. ACPI control devices and ensures there is no overlap with
PCI address space ranges which can cause issues with PCI device
remapping.
Use this allocator the ACPI platform devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than use the system MMIO allocator for RAM use an allocator that
covers the full RAM range.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This is because the SGX region will be placed between the end of ram and
the start of the device area.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With the segment id now encoded in the bdf it is not necessary to have
the separate field for it.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In particular use the accessor for getting the device id from the bdf.
As a side effect the VIOT table is now segment aware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have a non-overlapping memory range associated
with it the device memory must be equally divided amongst all segments.
A new allocator is used for each segment to ensure that BARs are
allocated from the correct address ranges. This requires changes to
PciDevice::allocate/free_bars to take that allocator and when
reallocating BARs the correct allocator must be identified from the
ranges.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For all the devices that support being hotplugged (disk, net, pmem, fs
and vsock) add "pci_segment" option and propagate that through to the
addition onto the PCI busses.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the decision on whether to use a 64-bit bar up to the DeviceManager
so that it can use both the device type (e.g. block) and the PCI segment
ID to decide what size bar should be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Generate a set of 8 IRQs and round-robin distribute those over all the
slots for a bus. This same set of IRQs is then used for all PCI
segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The platform config may specify a number of PCI segments to use, if this
greater than 1 then we add supplemental PCI segments as well as the
default segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This currently contains only the number over PCI segments to create.
This is limited to 16 at the moment which should allow 496 user specified
PCI devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For the bus scanning the GED AML code now calls into a PSCN method that
scans all buses. This approach was chosen since it handles the case
correctly where one GED interrupt is services for two hotplugs on
distinct segments.
The PCIU and PCID field values are now determined by the PSEG field that
is uses to select which segment those values should be used for.
Similarly _EJ0 will notify based on the value of _SEG.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Replace the hardcoded zero PCI segment id when adding devices to the bus
and extend the DeviceTree to hold the PCI segment id.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Since each segment must have disjoint address spaces only advertise
address space in the 32-bit range and the PIO address space on the
default (zero) segment.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Split PciSegment::new_default_segment() into a separate
PciSegment::new() and those parts required only for the default segment
(PIO PCI config device.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For now this still contains just one segment but is expanding in
preparation for more segments.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit moves the code that generates the DSDT data for the PCI bus
into PciSegment making no functional changes to the generated AML.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Variables that start with underscore are used to silence rustc.
Normally those variables are not used in code.
This patch drops the underscore from variables that are used. This is
less confusing to readers.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This makes it much easier to use since the info!() level produces far
fewer messages and thus has less overhead.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Relying on the vm-virtio/virtio-queue crate from rust-vmm which has been
copied inside the Cloud Hypervisor tree, the entire codebase is moved to
the new definition of a Queue and other related structures.
The reason for this move is to follow the upstream until we get some
agreement for the patches that we need on top of that to make it
properly work with Cloud Hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This was causing some issues because of the use of 2 different versions
for the vm-memmory crate. We'll wait for all dependencies to be properly
resolved before we move to 0.7.0.
This reverts commit 76b6c62d07.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When allocating PCI MMIO BARs they should always be naturally aligned
(i.e. aligned to the size of the BAR itself.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The allocate_bars method has a side effect which collates the BARs used
for the device and stores them internally. Ensure that any use of this
internal state is after the state is created otherwise no MMIO regions
will be seen and so none will be mapped.
Fixes: #3237
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of creating a MemoryManager from scratch, let's reuse the same
code path used by snapshot/restore, so that memory regions are created
identically to what they were on the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that all the pieces are in place, we can restore a VM with the new
codepath that restores properly all memory regions, allowing for ACPI
memory hotplug to work properly with snapshot/restore feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extending the MemoryManager::new() function to be able to create a
MemoryManager from data that have been previously stored instead of
always creating everything from scratch.
This change brings real added value as it allows a VM to be restored
respecting the proper memory layout instead of hoping the regions will
be created the way they were before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Storing multiple data coming from the MemoryManager in order to be able
to restore without creating everything from scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new function will be able to restore memory regions and memory
zones based on the GuestMemoryMapping list that will be provided through
snapshot/restore and migration phases.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This can help identifying which zone relates to which memory range.
This is going to be useful when recreating GuestMemory regions from
the previous layout instead of having to recreate everything from
scratch.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a dedicated function to factorize the allocation of the memory
ranges, and helping with the simplification of MemoryManager::new()
function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By updating the list of GuestMemory regions with the virtio-mem ones
before the creation of the MemoryManager, we know the GuestMemory is up
to date and the allocation of memory ranges is simplified afterwards.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to simplify MemoryManager::new() function. let's move the
memory configuration validation to its own function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Move the PciSegment struct and the associated code to a new file. This
will allow some clearer separation between the core DeviceManager and
PCI handling.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the PCI related state from the DeviceManager struct to a PciSegment
struct inside the DeviceManager. This is in preparation for multiple
segment support. Currently this state is just the bus itself, the MMIO
and PIO config devices and hotplug related state.
The main change that this required is using the Arc<Mutex<PciBus>> in
the device addition logic in order to ensure that
the bus could be created earlier.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When using PVH for booting (which we use for all firmwares and direct
kernel boot) the Linux kernel does not configure LA57 correctly. As such
we need to limit the address space to the maximum 4-level paging address
space.
If the user knows that their guest image can take advantage of the
5-level addressing and they need it for their workload then they can
increase the physical address space appropriately.
This PR removes the TDX specific handling as the new address space limit
is below the one that that code specified.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever running TDX, we must pass the ACPI tables to the TDVF firmware
running in the guest. The proper way to do this is by adding the tables
to the TdHob as a TdVmmData type, so that TDVF will know how to access
these tables and expose them to the guest OS.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of having the ACPI tables being created both in x86_64 and
aarch64 implementations of configure_system(), we can remove the
duplicated code by moving the ACPI tables creation in vm.rs inside the
boot() function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The argument `prefault` is provided in MemoryManager, but it can
only be used by SGX and restore.
With prefault (MAP_POPULATE) been set, subsequent page faults will
decrease during running, although it will make boot slower.
This commit adds `prefault` in MemoryConfig and MemoryZoneConfig.
To resolve conflict between memory and restore, argument
`prefault` has been changed from `bool` to `Option<bool>`, when
its value is None, config from memory will be used, otherwise
argument in Option will be used.
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
By using a single file for storing the memory ranges, we simplify the
way snapshot/restore works by avoiding multiples files, but the main and
more important point is that we have now a way to save only the ranges
that matter. In particular, the ranges related to virtio-mem regions are
not always fully hotplugged, meaning we don't want to save the entire
region. That's where the usage of memory ranges is interesting as it
lets us optimize the snapshot/restore process when one or multiple
virtio-mem regions are involved.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The function memory_range_table() will be reused by the MemoryManager in
a following patch to describe all the ranges that we should snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Copy only the memory ranges that have been plugged through virtio-mem,
allowing for an interesting optimization regarding the time it takes to
migrate a large virtio-mem device. Even if the hotpluggable space is
very large (say 64GiB), if only 1GiB has been previously added to the
VM, only 1GiB will be sent to the destination VM, avoiding the transfer
of the remaining 63GiB which are unused.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By creating the BlocksState object in the MemoryManager, we can directly
provide it to the virtio-mem device when being created. This will allow
the MemoryManager through each VirtioMemZone to have a handle onto the
blocks that are plugged at any point in time.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This will be helpful to support the creation of a MemoryRangeTable from
virtio-mem, as it uses 2M pages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding the snapshot/restore support along with migration as well,
allowing a VM with virtio-mem devices attached to be properly
migrated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The amount of memory plugged in the virtio-mem region should always be
kept up to date in the hotplugged_size field from VirtioMemZone.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no need to duplicate the GuestMemory for snapshot purpose, as we
always have a handle onto the GuestMemory through the guest_memory
field.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we only support a single PCI bus right now advertise only a single
bus in the ACPI tables. This reduces the number of VM exits from probing
substantially.
Number of PCI config I/O port exits: 17871 -> 1551 (91% reduction) with
direct kernel boot.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a simpler method for extracting the affected slot on the eject
command. Also update the terminology to reflect that this a slot rather
than a bdf (which is what device id refers to elsewhere.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Refactor the serial buffer handling in order to write the serial
buffer's output to a PTY connected after the serial device stops being
written to by the guest.
This change moves the serial buffer initialization inside the serial
manager. That is done to allow the serial buffer to be made aware of
the PTY and epoll fds needed in order to modify the
EpollDispatch::File trigger. These are then used by the serial buffer
to trigger an epoll event when the PTY fd is writable and the buffer
has content in it. They are also used to remove the trigger when the
buffer is emptied in order to avoid unnecessary wake-ups.
Signed-off-by: William Douglas <william.douglas@intel.com>
Both read_exact_from() and write_all_to() functions from the GuestMemory
trait implementation in vm-memory are buggy. They should retry until
they wrote or read the amount of data that was expected, but instead
they simply return an error when this happens. This causes the migration
to fail when trying to send important amount of data through the
migration socket, due to large memory regions.
This should be eventually fixed in vm-memory, and here is the link to
follow up on the issue: https://github.com/rust-vmm/vm-memory/issues/174
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement the infrastructure that lets a virtio-mem device map the guest
memory into the device. This is necessary since with virtio-mem zones
memory can be added or removed and the vfio-user device must be
informed.
Fixes: #3025
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
By moving this from the VfioUserPciDevice to DeviceManager the client
can be reused for handling DMA mapping behind an IOMMU.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For vfio-user the mapping handler is per device and needs to be removed
when the device in unplugged.
For VFIO the mapping handler is for the default VFIO container (used
when no vIOMMU is used - using a vIOMMU does not require mappings with
virtio-mem)
To represent these two use cases use an enum for the handlers that are
stored.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Looking up devices on the port I/O bus is time consuming during the
boot at there is an O(lg n) tree lookup and the overhead from taking a
lock on the bus contents.
Avoid this by adding a fast path uses the hardcoded port address and
size and directs PCI config requests directly to the device.
Command line:
target/release/cloud-hypervisor --kernel ~/src/linux/vmlinux --cmdline "root=/dev/vda1 console=ttyS0" --serial tty --console off --disk path=~/workloads/focal-server-cloudimg-amd64-custom-20210609-0.raw --api-socket /tmp/api
PIO exit: 17913
PCI fast path: 17871
Percentage on fast path: 99.8%
perf before:
marvin:~/src/cloud-hypervisor (main *)$ perf report -g | grep resolve
6.20% 6.20% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
perf after:
marvin:~/src/cloud-hypervisor (2021-09-17-ioapic-fast-path *)$ perf report -g | grep resolve
0.08% 0.08% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
The compromise required to implement this fast path is bringing the
creation of the PciConfigIo device into the DeviceManager::new() so that
it can be used in the VmmOps struct which is created before
DeviceManager::create_devices() is called.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>