This commit adds event fds and the event handler to send/receive
requests and responses from the GDB thread. It also adds `--gdb` flag to
enable GDB stub feature.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `stop_on_boot` to `Vm` so that the VM stops before
starting on boot requested. This change is required to keep the target
VM stopped before a debugger attached as the user expected.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `Vm::debug_request` to handle `GdbRequestPayload`,
which will be sent from the GDB thread.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds initial gdb.rs implementation for `Debuggable` trait to
describe a debuggable component. Some part of the trait bound
implementations is based on the crosvm GDB stub code [1].
[1] https://github.com/google/crosvm/blob/main/src/gdb.rs
Signed-off-by: Akira Moroo <retrage01@gmail.com>
This commit adds `VmState::BreakPoint` to handle hardware breakpoint.
The VM will enter this state when a breakpoint hits or a debugger
interrupts the execution.
Signed-off-by: Akira Moroo <retrage01@gmail.com>
Instead of doing the validation of the configuration change as part of
the vm, let's do this in the uper layer, in the Vmm.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's move add_to_config to config.rs so it can be used from both inside
and outside of the vm.rs file.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
In order to allow for human readable output for the VM configuration, we
pull it out of the snapshot, which becomes effectively the list of
states from the VM. The configuration is stored through a dedicated file
in JSON format (not including any binary output).
Having the ability to read and modify the VM configuration manually
between the snapshot and restore phases makes debugging easier, as well
as empowers users for extending the use cases relying on the
snapshot/restore feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If a payload is found in the TDVF section, and after it's been copied to
the guest memory, make sure to create the corresponding TdPayload
structure and insert it through the HOB.
Signed-off-by: Jiaqi Gao <jiaqi.gao@intel.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case of TDX, if a kernel and/or a command line are provided by the
user, they can't be treated the same way as for the non-TDX case. That
is why this patch ensures the function load_kernel() is only invoked for
the non-TDX case.
For the TDX case, whenever TDVF contains a Payload and/or PayloadParam
sections, the file provided through --kernel and the parameters provided
through --cmdline are copied at the locations specified by each TDVF
section.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to clearly decouple when the migration is started compared to
when the dirty logging is started, we introduce a new method to the
Migratable trait. This clarifies the semantics as we don't end up using
start_dirty_log() for identifying when the migration has been started.
And similarly, we rely on the already existing complete_migration()
method to know when the migration has been ended.
A bug was reported when running a local migration with a vhost-user-net
device in server mode. The reason was because the migration_started
variable was never set to "true", since the start_dirty_log() function
was never invoked.
Signed-off-by: lizhaoxin1 <Lxiaoyouling@163.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If a PMU is enabled in a VM, we also need to initialize the PMU
when the VM is restored. Otherwise, vCPUs cannot be started after
the VM is restored.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Relying on helpers for creating the ACPI tables and to add each table to
the HOB, this patch connects the dot to provide the set of ACPI tables
to the TD firmware.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case of TDX, we don't want to create the ACPI tables the same way we
do for all the other use cases. That's because the ACPI tables don't
need to be written to guest memory at a specific address, instead they
are passed directly through the HOB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's been decided the ACPI tables will be passed to the firmware in a
different way, rather than using TD_VMM_DATA. Since TD_VMM_DATA was
introduced for this purpose, there's no reason to keep it in our
codebase.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Ensure all pending virtio activations (as triggered by MMIO write on the
vCPU threads leading to a barrier wait) are completed before pausing the
vCPUs as otherwise there will a deadlock with the VMM waiting for the
vCPU to acknowledge it's pause and the vCPU waiting for the VMM to
activate the device and release the barrier.
Fixes: #3585
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When in local migration mode send the FDs for the guest memory over the
socket along with the slot that the FD is associated with. This removes
the requirement for copying the guest RAM and gives significantly faster
live migration performance (of the order of 3s to 60ms).
Fixes: #3566
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the VM using the FDs (wrapped in Files) that have been received
during the migration process.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It's important to maintain the ability to run in a non-TDX environment
a Cloud Hypervisor binary with the 'tdx' feature enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Saving the KVM clock and restoring it is key for correct behaviour of
the VM when doing snapshot/restore or live migration. The clock is
restored to the KVM state as part of the Vm::resume() method prior to
that it must be extracted from the state object and stored for later use
by this method. This change simplifies the extraction and storage part
so that it is done in the same way for both snapshot/restore and live
migration.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Currently, a tuple containing PCI space start address and PCI space
size is used to pass the PCI space information to the FDT creator.
In order to support the multiple PCI segment for FDT, more information
such as the PCI segment ID should be passed to the FDT creator. If we
still use a tuple to store these information, the code flexibility and
readablity will be harmed.
To address this issue, this commit replaces the tuple containing the
PCI space information to a structure `PciSpaceInfo` and uses a vector
of `PciSpaceInfo` to store PCI space information for each segment, so
that multiple PCI segment information can be passed to the FDT together.
Note that the scope of this commit will only contain the refactor of
original code, the actual multiple PCI segments support will be in
following series, and for now `--platform num_pci_segments` should only
be 1.
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
In order to avoid the identity map region to conflict with a possible
firmware being placed in the last 4MiB of the 4GiB range, we must set
the address to a chosen location. And it makes the most sense to have
this region placed right after the TSS region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Place the 3 page TSS at an explicit location in the 32-bit address space
to avoid conflicting with the loaded raw firmware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the provided binary isn't an ELF binary assume that it is a firmware
to be loaded in directly. In this case we shouldn't program any of the
registers as KVM starts in that state.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In particular use the accessor for getting the device id from the bdf.
As a side effect the VIOT table is now segment aware.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Variables that start with underscore are used to silence rustc.
Normally those variables are not used in code.
This patch drops the underscore from variables that are used. This is
less confusing to readers.
No functional change intended.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This makes it much easier to use since the info!() level produces far
fewer messages and thus has less overhead.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of creating a MemoryManager from scratch, let's reuse the same
code path used by snapshot/restore, so that memory regions are created
identically to what they were on the source VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extending the MemoryManager::new() function to be able to create a
MemoryManager from data that have been previously stored instead of
always creating everything from scratch.
This change brings real added value as it allows a VM to be restored
respecting the proper memory layout instead of hoping the regions will
be created the way they were before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When using PVH for booting (which we use for all firmwares and direct
kernel boot) the Linux kernel does not configure LA57 correctly. As such
we need to limit the address space to the maximum 4-level paging address
space.
If the user knows that their guest image can take advantage of the
5-level addressing and they need it for their workload then they can
increase the physical address space appropriately.
This PR removes the TDX specific handling as the new address space limit
is below the one that that code specified.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever running TDX, we must pass the ACPI tables to the TDVF firmware
running in the guest. The proper way to do this is by adding the tables
to the TdHob as a TdVmmData type, so that TDVF will know how to access
these tables and expose them to the guest OS.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of having the ACPI tables being created both in x86_64 and
aarch64 implementations of configure_system(), we can remove the
duplicated code by moving the ACPI tables creation in vm.rs inside the
boot() function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The argument `prefault` is provided in MemoryManager, but it can
only be used by SGX and restore.
With prefault (MAP_POPULATE) been set, subsequent page faults will
decrease during running, although it will make boot slower.
This commit adds `prefault` in MemoryConfig and MemoryZoneConfig.
To resolve conflict between memory and restore, argument
`prefault` has been changed from `bool` to `Option<bool>`, when
its value is None, config from memory will be used, otherwise
argument in Option will be used.
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
By using a single file for storing the memory ranges, we simplify the
way snapshot/restore works by avoiding multiples files, but the main and
more important point is that we have now a way to save only the ranges
that matter. In particular, the ranges related to virtio-mem regions are
not always fully hotplugged, meaning we don't want to save the entire
region. That's where the usage of memory ranges is interesting as it
lets us optimize the snapshot/restore process when one or multiple
virtio-mem regions are involved.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The function memory_range_table() will be reused by the MemoryManager in
a following patch to describe all the ranges that we should snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Copy only the memory ranges that have been plugged through virtio-mem,
allowing for an interesting optimization regarding the time it takes to
migrate a large virtio-mem device. Even if the hotpluggable space is
very large (say 64GiB), if only 1GiB has been previously added to the
VM, only 1GiB will be sent to the destination VM, avoiding the transfer
of the remaining 63GiB which are unused.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Both read_exact_from() and write_all_to() functions from the GuestMemory
trait implementation in vm-memory are buggy. They should retry until
they wrote or read the amount of data that was expected, but instead
they simply return an error when this happens. This causes the migration
to fail when trying to send important amount of data through the
migration socket, due to large memory regions.
This should be eventually fixed in vm-memory, and here is the link to
follow up on the issue: https://github.com/rust-vmm/vm-memory/issues/174
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Looking up devices on the port I/O bus is time consuming during the
boot at there is an O(lg n) tree lookup and the overhead from taking a
lock on the bus contents.
Avoid this by adding a fast path uses the hardcoded port address and
size and directs PCI config requests directly to the device.
Command line:
target/release/cloud-hypervisor --kernel ~/src/linux/vmlinux --cmdline "root=/dev/vda1 console=ttyS0" --serial tty --console off --disk path=~/workloads/focal-server-cloudimg-amd64-custom-20210609-0.raw --api-socket /tmp/api
PIO exit: 17913
PCI fast path: 17871
Percentage on fast path: 99.8%
perf before:
marvin:~/src/cloud-hypervisor (main *)$ perf report -g | grep resolve
6.20% 6.20% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
perf after:
marvin:~/src/cloud-hypervisor (2021-09-17-ioapic-fast-path *)$ perf report -g | grep resolve
0.08% 0.08% vcpu0 cloud-hypervisor [.] vm_device:🚌:Bus::resolve
The compromise required to implement this fast path is bringing the
creation of the PciConfigIo device into the DeviceManager::new() so that
it can be used in the VmmOps struct which is created before
DeviceManager::create_devices() is called.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a virtio-iommu node into FDT if iommu option is turned on. Now we
support only one virtio-iommu device.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
This change switches from handling serial input in the VMM thread to
its own thread controlled by the SerialManager.
The motivation for this change is to avoid the VMM thread being unable
to process events while serial input is happening and vice versa.
The change also makes future work flushing the serial buffer on PTY
connections easier.
Signed-off-by: William Douglas <william.douglas@intel.com>
When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize. This is the only way to be notified
by the kernel of a pty resize.
We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in. To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty. The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal. This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.
Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity. I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.
I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process. The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).
Signed-off-by: Alyssa Ross <hi@alyssa.is>