Storing a strong reference to the AddressManager behind the
DeviceRelocation trait results in a cyclic reference count.
Use a weak reference to break that dependency.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the value being written to the BAR, the implementation can
now detect if the BAR is being moved to another address. If that is the
case, it invokes move_bar() function from the DeviceRelocation trait.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to trigger the PCI BAR reprogramming from PciConfigIo and
PciConfigMmmio, we need the PciBus to have a hold onto the trait
implementation of DeviceRelocation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By implementing the DeviceRelocation trait for the AddressManager
structure, we now have a way to let the PCI BAR reprogramming happen.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to reuse the SystemAllocator later at runtime, it is moved into
the new structure AddressManager. The goal is to have a hold onto the
SystemAllocator and both IO and MMIO buses so that we can use them
later.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case a VFIO devices is being attached behind a virtual IOMMU, we
should not automatically map the entire guest memory for the specific
device.
A VFIO device attached to the virtual IOMMU will be driven with IOVAs,
hence we should simply wait for the requests coming from the virtual
IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When VFIO devices are created and if the device is attached to the
virtual IOMMU, the ExternalDmaMapping trait implementation is created
and associated with the device. The idea is to build a hash map of
device IDs with their associated trait implementation.
This hash map is provided to the virtual IOMMU device so that it knows
how to properly trigger external mappings associated with VFIO devices.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With this implementation of the trait ExternalDmaMapping, we now have
the tool to provide to the virtual IOMMU to trigger the map/unmap on
behalf of the guest.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The VFIO container is the object needed to update the VFIO mapping
associated with a VFIO device. This patch allows the device manager
to have access to the VFIO container.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch attaches VFIO devices to the virtual IOMMU if they are
identified as they should be, based on the option "iommu=on". This
simply takes care of adding the PCI device ID to the ACPI IORT table.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a VFIO device should be attached to the virtual
IOMMU or not. That's why we introduce an extra option "iommu" with the
value "on" or "off". By default, the device is not attached, which means
"iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Fix invalid type for version:
- VmInfo.version.type string
Change Null value from enum as it has problems to build clients with
openapi tools.
- ConsoleConfig.mode.enum Null -> Nil
Signed-off-by: Jose Carlos Venegas Munoz <jose.carlos.venegas.munoz@intel.com>
We should return an explicit error when the transition from on VM state
to another is invalid.
The valid_transition() routine for the VmState enum essentially
describes the VM state machine.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to pause a VM, we signal all the vCPU threads to get them out
of vmx non-root. Once out, the vCPU thread will check for a an atomic
pause boolean. If it's set to true, then the thread will park until
being resumed.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
So that we don't need to forward an ExitBehaviour up to the VMM thread.
This simplifies the control loop and the VMM thread even further.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This commit is the glue between the virtio-pci devices attached to the
vIOMMU, and the IORT ACPI table exposing them to the guest as sitting
behind this vIOMMU.
An important thing is the trait implementation provided to the virtio
vrings for each device attached to the vIOMMU, as they need to perform
proper address translation before they can access the buffers.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-vsock device should be attached to this
virtual IOMMU or not. That's why we introduce an extra option "iommu"
with the value "on" or "off". By default, the device is not attached,
which means "iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-console device should be attached to
this virtual IOMMU or not. That's why we introduce an extra option
"iommu" with the value "on" or "off". By default, the device is not
attached, which means "iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-pmem device should be attached to this
virtual IOMMU or not. That's why we introduce an extra option "iommu"
with the value "on" or "off". By default, the device is not attached,
which means "iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-rng device should be attached to this
virtual IOMMU or not. That's why we introduce an extra option "iommu"
with the value "on" or "off". By default, the device is not attached,
which means "iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-net device should be attached to this
virtual IOMMU or not. That's why we introduce an extra option "iommu"
with the value "on" or "off". By default, the device is not attached,
which means "iommu=off".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the virtual IOMMU created with --iommu is one thing, but we also
need a way to decide if a virtio-blk device should be attached to this
virtual IOMMU or not. That's why we introduce an extra option "iommu"
with the value "on" or "off". By default, the device is not attached,
which means "iommu=off".
One side effect of this new option is that we had to introduce a new
option for the disk path, simply called "path=".
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding a simple iommu boolean field to the VmConfig structure so that we
can later use it to create a virtio-iommu device for the current VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtual IOMMU exposed through virtio-iommu device has a dependency
on ACPI. It needs to expose the device ID of the virtio-iommu device,
and all the other devices attached to this virtual IOMMU. The IDs are
expressed from a PCI bus perspective, based on segment, bus, device and
function.
The guest relies on the topology description provided by the IORT table
to attach devices to the virtio-iommu device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case some virtio devices are attached to the virtual IOMMU, their
vring addresses need to be translated from IOVA into GPA. Otherwise it
makes no sense to try to access them, and they would cause out of range
errors.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding virtio feature VIRTIO_F_IOMMU_PLATFORM when explicitly asked by
the user. The need for this feature is to be able to attach the virtio
device to a virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtio specification defines a device can be reset, which was not
supported by this virtio-console implementation. The reason it is needed
is to support unbinding this device from the guest driver, and rebind it
to vfio-pci driver.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We used to have errors definitions spread across vmm, vm, api,
and http.
We now have a cleaner separation: All API routines only return an
ApiResult. All VM operations, including the VMM wrappers, return a
VmResult. This makes it easier to carry errors up to the HTTP caller.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to support further use cases where a VM configuration could be
modified through the HTTP API, we only store the passed VM config when
being asked to create a VM. The actual creation will happen when booting
a new config for the first time.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We use the serde crate to serialize and deserialize the VmVConfig
structure. This structure will be passed from the HTTP API caller as a
JSON payload and we need to deserialize it into a VmConfig.
For a convenient use of the HTTP API, we also provide Default traits
implementations for some of the VmConfig fields (vCPUs, memory, etc...).
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The linux_loader crate Cmdline struct is not serializable.
Instead of forcing the upstream create to carry a serde dependency, we
simply use a String for the passed command line and build the actual
CmdLine when we need it (in vm::new()).
Also, the cmdline offset is not a configuration knob, so we remove it.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
They point to a vm_virtio structure (VhostUserConfig) and in order to
make the whole config serializable (through the serde crate for
example), we'd have to add a serde dependency to the vm_virtio crate.
Instead we use a local, serializable structure and convert it to
VhostUserConfig from the DeviceManager code.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The kernel path was the only mandatory command line option.
With the addition of the --api-socket option, we can run without a
kernel path and get it later through the API.
Since we can end up with VM configurations that are no longer valid by
default, we need to provide a validation check for it. For now, if the
kernel path is not defined, the VM configuration is invalid.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The Cloud Hyper HTTP server runs a synchronous, multi-threaded
loop that receives HTTP requests and tries to call the corresponding
endpoint handlers for the requests URIs.
An endpoint handler will parse the HTTP request and potentially
translate it into and IPC request. The handler holds an notifier and an
mspc Sender for respectively notifying and sending the IPC payload to
the VMM API server. The handler then waits for an API server response
and translate it back into an HTTP response.
The HTTP server is responsible for sending the reponse back to the
caller.
The HTTP server uses a static routes hash table that maps URIs to
endpoint handlers.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The cloud-hypervisor API uses HTTP as a transport and is accessible
through a local UNIX socket.
The API root path is /api/v1 and is a collection of RPC-style methods.
All methods are static, unlike typical REST APIs. Variable (e.g. device
IDs) are passed through the request body.
Fixes: #244
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Based off of crosvm revision b5237bbcf074eb30cf368a138c0835081e747d71
add a CMOS device. This environments that can't use KVM clock to get the
current time (e.g. Windows and EFI.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Refactor the PCI datastructures to move the device ownership to a PciBus
struct. This PciBus struct can then be used by both a PciConfigIo and
PciConfigMmio in order to expose the configuration space via both IO
port and also via MMIO for PCI MMCONFIG.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to avoid introducing a dependency on arch in the devices crate
pass the constant in to the IOAPIC device creation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Using the existing layout module start documenting the major regions of
RAM and those areas that are reserved. Some of the constants have also
been renamed to be more consistent and some functions that returned
constant variables have been replaced.
Future commits will move more constants into this file to make it the
canonical source of information about the memory layout.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We now start the main VMM thread, which will be listening for VM and IPC
related events.
In order to start the configured VM, we no longer directly call the VM
API but we use the IPC instead, to first create and then start a VM.
Fixes: #303
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Based on the newly defined Cloud Hypervisor IPC, those helpers send
VmCreate and VmStart requests respectively. This will be used by the
main thread to create and start a VM based on the CLI parameters.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This starts the main, single VMM thread, which:
1. Creates the VMM instance
2. Starts the VMM control loop
3. Manages the VMM control loop exits for handling resets and shutdowns.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Unlike the Vmm structure we removed with commit bdfd1a3f, this new one
is really meant to represent the VM monitoring/management object.
For that, we implement a control loop that will replace the one that's
currently embedded within the Vm structure itself.
This will allow us to decouple the VM lifecycle management from the VM
object itself, by having a constantly running VMM control loop.
Besides the VM specific events (exit, reset, stdin for now), the VMM
control loop also handles all the Cloud Hypervisor IPC requests.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The VMM thread and control loop will be the sole consumer of the
EpollContext and EpollDispatch API, so let's move it to lib.rs.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Cloud Hypervisor IPC is a simple, mpsc based protocol for threads to
send command to the furture VMM thread. This patch adds the API
definition for that IPC, which will be used by both the main thread
to e.g. start a new VM based on the CLI arguments and the future HTTP
server to relay external requests received from a local Unix domain
socket.
We are moving it to its own "api" module because this is where the
external API (HTTP based) will also be implemented.
The VMM thread will be listening for IPC requests from an mpsc receiver,
process them and send a response back through another mpsc channel.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
As we're going to move the control loop to the VMM thread, the exit and
reset EventFds are no longer going to be owned by the VM.
We pass a copy of them when creating the Vm instead.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to handle the VM STDIN stream from a separate VMM thread
without having to export the DeviceManager, we simply add a console
handling method to the Vm structure.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to transfer the control loop to a separate VMM thread, we want
to shrink the VM control loop to a bare minimum.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Once passed to the VM creation routine, a VmConfig structure is
immutable. We can simply carry a Arc of it instead of a reference.
This also allows us to remove any lifetime bound from our VM.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The Vmm structure is just a placeholder for the KVM instance. We can
create it directly from the VM creation routine instead.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We can integrate the kernel loading into the VM start method.
The VM start flow is then: Vm::new() -> vm.start(), which feels more
natural.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Convert Path to PathBuf and remove the associated lifetime.
Now we can remove the VmConfig associated lifetime.
Fixes#298
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Since vhost-user-blk use same error definition with vhost-user-net,
those errors need define to common usage.
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Probe for the size of the host physical address range and use that to
establish the address range for the VM. This removes the limitation on
the size of the VM RAM and gives more space for the devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
After the 32-bit gap the memory is shared between the devices and the
RAM. Ensure that the ACPI tables correctly indicate where the RAM ends
and the device area starts by patching the precompiled tables. We get
the following valid output now from the PCI bus probing (8GiB guest)
[ 0.317757] pci_bus 0000:00: resource 4 [io 0x0000-0x0cf7 window]
[ 0.319035] pci_bus 0000:00: resource 5 [io 0x0d00-0xffff window]
[ 0.320215] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[ 0.321431] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xfebfffff window]
[ 0.322613] pci_bus 0000:00: resource 8 [mem 0x240000000-0xfffffffff window]
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rerrange "use" statements and make rename variables and fields to
indicate they might be unused.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This removes the register_devices() function with all that functionality
spread across the places where the devices are created.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Mark exit_evt with an underscore it may be unused (it is ignored if the
"acpi" feature is not turned on.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add (non-default) support for using MMIO for virtio devices. This can be
tested by:
cargo build --no-default-features --features "mmio"
All necessary options will be included injected into the kernel
commandline.
Fixes: #243
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than calling it at the very start of the VM execution (i.e. when
the VCPUs are created) do it as part of the DeviceManager creation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Create the virtio devices independently of adding them to the PCI bus.
Instead accrue the devices in a vector and add them to the bus en-masse.
This will allow the virtio device creation to be used independently of
PCI based transport.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than sending a signal to the signal handler used for handling
SIGWINCH calls instead use the crate provided termination method. This
also unregisters the signal handler which also means that there won't be
a leaked signal handler remaining.
This leaked signal handler is what was causing a failure to cleanup up
the thread on subsequent requests breaking two reboots in a row.
Fixes: #252
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With ACPI disabled there is no way to support both reset and shutdown so
make the VMM exit if the VM is rebootet (via i8042 or triple-fault
reset.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit relies on the new vsock::unix module to create the backend
that will be used from the virtio-vsock device.
The concept of backend is interesting here as it would allow for a vhost
kernel backend to be plugged if that was needed someday.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on previous patch introducing the new flag "--vsock", this commit
creates a new virtio-vsock device based on the presence of this flag.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The new flag vsock is meant to be used in order to create a VM with a
virtio-vsock device attached to it. Two parameters are needed with this
device, "cid" representing the guest context ID, and "sock" representing
the UNIX socket path which can be accessed from the host.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The default number of MSI-X vector allocated was 2, which is the minimum
defined by the virtio specification. The reason for this minimum is that
virtio needs at least one interrupt to signal that configuration changed
and at least one to specify something happened regarding the virtqueues.
But this current implementation is not optimal because our VMM supports
as many MSI-X vectors as allowed by the MSI-X specification (2048 max).
For that reason, the current patch relies on the number of virtqueues
needed by the virtio device to determine the right amount of MSI-X
vectors needed. It's important not to forget the dedicated vector for
any configuration change too.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Refactor out DeviceManager into it's own file. This is part of a bigger
effort to reduce complexity in the vm.rs file but will also allow future
separation to allow making PCI support optional.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For virtio-fs and virtio-pmem regions of memory are manually mapped into
the address space of the VMM. In order to cleanly reboot we need to
unmap those regions.
Fixes: #223
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Do this by using the same mechanism as the vCPU threads by sending a
signal to the thread. As this is the same mechanism reuse the same code
and rename the "vcpus" member to "threads" to indicate this represents
both the vCPU threads and also the signal handler thread.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Put the ACPI support behind a feature and ensure that the code compiles
without that feature by adding an extra build to Travis.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
As part of the cleanup of the VM shutdown all the vCPU threads. This is
achieved by toggling a shared atomic boolean variable which is checked
in the vCPU loop. To trigger the vCPU code to look at this boolean it is
necessary to send a signal to the vCPU which will interrupt the running
KVM_RUN ioctl.
Fixes: #229
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Being able to reboot requires us to identify all the resources we are
leaking and cleaning those up before we can enable reboot. For now if
the user requests a reboot then shutdown instead.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Sadly only the first few characters of the thread name is preserved so
use a shorter name for the vCPU thread for now. Also give the signal
handling thread a name.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add an I/O port "device" to handle requests from the kernel to shutdown
or trigger a reboot, borrowing an I/O used for ACPI on the Q35 platform.
The details of this I/O port are included in the FADT
(SLEEP_STATUS_REG/SLEEP_CONTROL_REG/RESET_REG) with the details of the
value to write in the FADT for reset (RESET_VALUE) and in the DSDT for
shutdown (S5 -> 0x05)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a 2nd EventFd to the VM to control resetting (rebooting) the VM this
supplements the EventFd used for managing shutdown of the VM.
The default behaviour on i8042 or triple-fault based reset is currently
unchanged i.e. it will trigger a shutdown.
In order to support restarting the VM it was necessary to make start()
function take a reference to the config.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Only add the ACPI PNP device for the COM1 serial port if it is not
turned off with "--serial off"
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Currently when the VCPU thread exits on an error the VMM continues to
run with no way of shutting down the main thread.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This patch factorizes the existing virtio-fs code by relying onto the
common code part of the vhost_user module in the vm-virtio crate.
In details, it factorizes the vhost-user setup, and reuses the error
types defined by the module instead of defining its own types.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
vhost-user-net introduced a new module vhost_user inside the vm-virtio
crate. Because virtio-fs is actually vhost-user-fs, it belongs to this
new module and needs to be moved there.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The currently directory handling process to open tempfile by
OpenOptions with custom_flags(O_TMPFILE) is workable for tmp
filesystem, but not workable for hugetlbfs, add new directory
handling process which works fine for both tmpfs and hugetlbfs.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Following the refactoring of the code allowing multiple threads to
access the same instance of the guest memory, this patch goes one step
further by adding RwLock to it. This anticipates the future need for
being able to modify the content of the guest memory at runtime.
The reasons for adding regions to an existing guest memory could be:
- Add virtio-pmem and virtio-fs regions after the guest memory was
created.
- Support future hotplug of devices, memory, or anything that would
require more memory at runtime.
Because most of the time, the lock will be taken as read only, using
RwLock instead of Mutex is the right approach.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The VMM guest memory was cloned (copied) everywhere the code needed to
have ownership of it. In order to clean the code, and in anticipation
for future support of modifying this guest memory instance at runtime,
it is important that every part of the code share the same instance.
Because VirtioDevice implementations need to have access to it from
different threads, that's why Arc must be used in this case.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Latest clippy version complains about our existing code for the
following reasons:
- trait objects without an explicit `dyn` are deprecated
- `...` range patterns are deprecated
- lint `clippy::const_static_lifetime` has been renamed to
`clippy::redundant_static_lifetimes`
- unnecessary `unsafe` block
- unneeded return statement
All these issues have been fixed through this patch, and rustfmt has
been run to cleanup potential formatting errors due to those changes.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We timestamp the VM creation time, and log the elapsed time between that
instant and the debug ioport events.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The 0x80 IO port is typically used for BIOS debugging and testing on
bare metal x86 platforms.
We use that port and its dedicated 16 debug codes to time and track the
guest boot process.
Fixes#63
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When the cache_size parameter from virtio-fs device is not empty, the
VMM creates a dedicated memory region where the shared files will be
memory mapped by the virtio-fs device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to support the more performant version of virtio-fs, that is
the one relying on a shared memory region between host and guest, we
introduce two new parameters to the --fs device.
The "dax" parameter allows the user to choose if he wants to use the
shared memory region with virtio-fs. By default, this parameter is "on".
The "cache_size" parameter allows the user to specify the amount of
memory that should be shared between host and guest. By default, the
value of this parameter is 8Gib as advised by virtio-fs maintainers.
Note that dax=off and cache_size are incompatible.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
One of the features of the virtio console device is its size can be
configured and updated. Our first iteration of the console device
implementation is lack of this feature. As a result, it had a
default fixed size which could not be changed. This commit implements
the console config feature and lets us change the console size from
the vmm side.
During the activation of the device, vmm reads the current terminal
size, sets the console configuration accordinly, and lets the driver
know about this configuration by sending an interrupt. Later, if
someone changes the terminal size, the vmm detects the corresponding
event, updates the configuration, and sends interrupt as before. As a
result, the console device driver, in the guest, updates the console
size.
Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
Poor performance was observed when booting kernels with "console=ttyS0"
and the serial port disabled.
This change introduces a "null" console output mode and makes it the
default for the serial console. In this case the serial port
is advertised as per other output modes but there is no input and any
output is dropped.
Fixes: #163
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The structure of the vmm-sys-util crate has changed with lots of code
moving to submodules.
This change adjusts the use of the imported structs to reference the
submodules.
Fixes: #145
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The existing code taking care of the epoll loop was too restrictive as
it was propagating the error returned from the epoll_wait() syscall, no
matter what was the error. This causes the epoll loop to be broken,
leading to the VM termination.
This patch enforces the parsing of the returned error and prevent from
the error propagation in case it is EINTR, which stands for Interrupted.
In case the epoll loop is interrupted, it is appropriate to retry.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to fix the clippy error complaining about the number of
arguments passed to a function exceeding the maximum of 7 arguments,
this patch factorizes those parameters into a more global one called
VmInfo.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since virtio-pmem uses a KVM user memory region, it needs to increment
the slot index in use to prevent from any conflict with further VFIO
allocations (used for mapping mappable memory BARs).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We add a Reserved region type at the end of the memory hole to prevent
32-bit devices allocations to overlap with architectural address ranges
like IOAPIC, TSS or APIC ones.
Eventually we should remove that reserved range by allocating all the
architectural ranges before letting 32-bit devices use the memory hole.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We want to be able to differentiate between memory regions that must be
managed separately from the main address space (e.g. the 32-bit memory
hole) and ones that are reserved (i.e. from which we don't want to allow
the VMM to allocate address ranges.
We are going to use a reserved memory region for restricting the 32-bit
memory hole from expanding beyond the IOAPIC and TSS addresses.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to correctly support multiple VFIO devices, we need to
increment the memory slot index every time it is being used to set some
user memory region through KVM. That's why the mem_slot parameter is
made mutable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The memory slot index provided to the DeviceManager was wrong since
only the RAM memory regions are set as user memory regions to KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
KVM does not support multiple KVM VFIO devices to be created when
trying to support multiple VFIO devices. This commit creates one
global KVM VFIO device being shared with every VFIO device, which
makes possible the support for passing several devices through the
VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the VFIO crate, we can now support directly assigned PCI devices
into cloud-hypervisor guests.
We support assigning multiple host devices, through the --device command
line parameter. This parameter takes the host device sysfs path.
Fixes: #60
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Our DeviceManager::new() routine is reaching north of 250 lines.
For simplicity and readbility sake, extract all virtio devices creation
code into their own routines.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
To use the implemented virtio console device, the users can select one
of the three options ("off", "tty" or "file=/path/to/the/file") with
the command line argument "--console". By default, the console is
enabled as a device named "hvc0" (option: tty). When "off" option is
used, the console device is not added to the VM configuration at all.
Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
With this new AddressAllocator as part of the SystemAllocator, the
VMM can now decide with finer granularity where to place memory.
By allocating the RAM and the hole into the MMIO address space, we
ensure that no memory will be allocated by accident where the RAM or
where the hole is.
And by creating the new MMIO hole address space, we create a subset
of the entire MMIO address space where we can place 32 bits BARs for
example.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
GSI (Global System Interrupt) is an extension of just a linear array of
IRQs. It takes IOAPICs into account for example.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
There is alignment support for AddressAllocator but there are occations
that the alignment is known only when we call allocate(). One example
is PCI BAR which is natually aligned, means for which we have to align
the base address to its size.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This is only for allocating the port IO address range.
If a platform does not have PIO devices at all, the address
range will simply be unused.
So, simplify the vm-allocator data structure by making both
MMIO and PIO mandatory.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This patch adds the support for both IO and Memory BARs by expecting
the function allocate_bars() to identify the type of each BAR.
Based on the type, register_mapping() insert the address range on the
appropriate bus (PIO or MMIO).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a "--serial" command line that takes as input either "off", "tty"
(default and current behaviour) and "file=/path/to/file".
When "--serial off" is used the serial device is not added to the VM
configuration at all.
Integration tests added that check for interrupts present (or not) and
that when sending to a file the file contains the expected serial
output.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that cloud-hypervisor VMM supports virtio-pmem, it can directly
boot a VM from an image exposed as a persistent memory block device.
That's why there is no need to force the --disk option as being
mandatory.
Fixes#90
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the VMM was only accepting a single instance of virtio-net
device. This commit extends the virtio-net support by allowing several
devices to be created for a single VM.
Fixes#71
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
For every parameter dealing with a size as option, such as memory or
virtio-pmem, the CLI can now parse sizes with the suffixes K, M or G.
Fixes#70
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
PciConfigIo is a legacy pci bus dispatcher, which manages all pci
devices including a pci root bridge. However, it is unnecessary to
design a complex hierarchy which redirects every access by PciRoot.
Since pci root bridge is also a pci device instance, and only contains
easy config space read/write, and PciConfigIo actually acts as a pci bus
to dispatch resource based resolving when VMExit, we re-arrange to make
the pci hierarchy clean.
Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
Until now, the VMM was only accepting a single instance of virtio-pmem
device. This commit extend the virtio-pmem support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch plumbs the virtio-pmem device to the VMM. By adding a new
command line option "--pmem", we can now expose some persistent memory
to the guest OS, backed by the provided source.
The point of having such support in cloud-hypervisor is to be able to
share some memory between the host and the guest as DAXable.
One interesting use case is to boot directly from an image passed
through virtio-pmem, instead of going through virtio-blk. This can
allow good performances while avoiding the guest cache, which would
prevent the VM memory footprint from growing too much.
Fixes#68
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the VMM was only accepting a single instance of a virtio-fs
device. This commit extend the virtio-fs support by allowing several
devices to be created for a single VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In the context of vhost-user, we need the guest RAM to be backed by
a file in order to be accessed by an external process. This patch
adds the new flag "file=" to the "--memory" option so that we can
specify from the command line if the memory needs to be backed, and
by which specific file.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The user can now share some files and directories with the guest by
providing the corresponding vhost-user socket. The virtiofsd daemon
should be started by the user before to start the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
BusDevice includes two methods which are only for PCI devices, which should
be as members of PciDevice trait for a better clean high level APIs.
Signed-off-by: Jing Liu <jing2.liu@linux.intel.com>
The previous commit introduced a userspace implementation of an IOAPIC
and this commits aims to plumb it into the cloud-hypervisor VMM.
Here is the list of new things brought by this patch:
- Update the rust-vmm/kvm-ioctls dependency to benefit from latest
patches including the support for split irqchip, and the vector
being returned when a VM exit is caused by an EOI.
- Enable the split irqchip (which means no IOAPIC or PIC is emulated
in kernel). This is done conditionally based on the support of the
TSC_DEADLINE_TIMER from both KVM and the underlying CPU. The
dependency on TSC_DEADLINE_TIMER is related to KVM which does not
support creating the in kernel PIT if it has a split irqchip.
- Rely on callbacks to handle the following use cases:
- in kernel IOAPIC + serial IRQ (pin based)
- in kernel IOAPIC + virtio-pci MSI-X
- in kernel IOAPIC + virtio-pci IRQ (pin based)
- userspace IOAPIC + serial IRQ (pin based)
- userspace IOAPIC + virtio-pci MSI-X
- userspace IOAPIC + virtio-pci IRQ (pin based)
Fixes#13
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit anticipate the future need from having support for both
in kernel and userspace IOAPIC. The way to signal an interrupt from
the serial device will vary depending on the use case, but this should
be independent from the serial implementation itself.
That's why this patch provides a generic trait for the serial device
to call from, so that it can trigger interrupts independently from the
IOAPIC type chosen (in kernel vs userspace).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
VMM may load different format kernel image to start guest, we currently
only have elf loader support, so add bzimage loader support in case
that VMM would like to load bzimage.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
As more CPUID handling and CpuidPatch common code being added, it's
reasonable to move all the common code to the same place and in the
future we may consider move it to individual file when neccesary.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
KVM exposes CPUID 0BH when host supports that, but the APIC ID that KVM
provides is the host APIC ID so we need replace that with ours.
Without this Linux guest reports something like:
[Firmware Bug]: CPU1: APIC id mismatch. Firmware: 1 APIC: 21
Fixes#42
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
As mentioned in the KVM documentation, TSC_DEADLINE_TIMER feature
needs some special checks to validate that it is supported as the
cpuid will always report it as disabled.
We need to use the KVM_CHECK_EXTENSION ioctl to request the value
of KVM_CAP_TSC_DEADLINE_TIMER. In case it is supported through
the local APIC emulation provided by the CREATE_IRQCHIP in KVM,
we have to set manually this feature by patching the cpuid.
Here quoted from the KVM documentation:
```
The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always
returned as false, since the feature depends on KVM_CREATE_IRQCHIP
for local APIC support. Instead it is reported via
ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER)
if that returns true and you use KVM_CREATE_IRQCHIP, or if you
emulate the feature in userspace, then you can enable the feature
for KVM_SET_CPUID2.
```
This patch implements the behavior described above, and this allows
the VMM to remove the emulated Programmable Interval Timer (PIT) when
the TSC_DEADLINE_TIMER feature can be enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When the KVM capability KVM_CAP_SIGNAL_MSI is not present, the VMM
falls back from MSI-X onto pin based interrupts. Unfortunately, this
was not working as expected because the VirtioPciDevice object was
always creating an MSI-X capability structure in the PCI configuration
space. This was causing the guest drivers to expect MSI-X interrupts
instead of the pin based generated ones.
This patch takes care of avoiding the creation of a dedicated MSI-X
capability structure when MSI is not supported by KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to factorize the complexity brought by closures, this commit
merges IrqClosure and MsixClosure into a generic InterruptDelivery one.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to allow virtio-pci devices to use MSI-X messages instead
of legacy pin based interrupts, this patch implements the MSI-X
support for cloud-hypervisor. The VMM code and virtio-pci bits have
been modified based on the "msix" module previously added to the pci
crate.
Fixes#12
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to have access to the newly added signal_msi() function
from the kvm-ioctls crate, this commit updates the version of the
kvm-ioctls to the latest one.
Because set_user_memory_region() has been swtiched to "unsafe", we
also need to handle this small change in our cloud-hypervisor code
directly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because we cannot always assume the irq fd will be the way to send
an IRQ to the guest, this means we cannot make the assumption that
every virtio device implementation should expect an EventFd to
trigger an IRQ.
This commit organizes the code related to virtio devices so that it
now expects a Rust closure instead of a known EventFd. This lets the
caller decide what should be done whenever a device needs to trigger
an interrupt to the guest.
The closure will allow for other type of interrupt mechanism such as
MSI to be implemented. From the device perspective, it could be a
pin based interrupt or an MSI, it does not matter since the device
will simply call into the provided callback, passing the appropriate
Queue as a reference. This design keeps the device model generic.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When not running on a tty (tested with libc's isatty()) disable stdin
and do not reconfigure the terminal.
This is required to ensure that the VM responds correctly when running
in a headless environment such as Jenkins.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead return from the control_loop() and calling function cleanly.
This is helpful for the testing framework as that means we can launch
multiple VMs in a row.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The command line parsing of the user input was not properly
abstracted from the vmm specific code. In the case of --net,
the parsing was done when the device manager was adding devices.
In order to fix this confusion, this patch introduces a new
module "config" dedicated to the translation of a VmParams
structure into a VmCfg structure. The former is built based
on the input provided by the user, while the latter is the
result of the parsing of every options.
VmCfg is meant to be consumed by the vmm specific code, and
it is also a fully public structure so that it can directly
be built from a testing environment.
Fixes#31
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Store the list of disks in a Vec<PathBuf> and then iterate over that
when creating the block devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Use a catchall case for all reasons that we do not handle, and
move the vCPU run switch into its own function.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to get meaningful error messages, we want to make sure all
errors are passed up the call stack. This patch fixes this previous
limitation by separating errors related to the DeviceManager from
errors related to the Vm.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Until now, the only way to get some networking with cloud-hypervisor
was to let the user create a TAP interface first, and then to provide
the name of this interface to the VMM.
This patch extend the previous behavior by adding the support for the
creation of a brand new TAP interface from the VMM itself. In case no
interface name is provided through "tap=<if_name>", we will assume
the user wants the VMM to create and set the interface on its behalf,
no matter the value of other parameters (ip, mask, and mac).
In this same scenario, because the user expects the VMM to create the
TAP interface, he can also provide the associated IP address and subnet
mask associated with it. In case those values are not provided, some
default ones will be picked.
No matter the value of "tap", the MAC address will always be set, and
if no value is provided, the VMM will come up with a default value for
it.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Most of the code is taken from crosvm(bbd24c5) but is modified to
be adapted to the current VirtioDevice definition and epoll
implementation.
A new command option '--rng' is provided and it gives one the option
to override the entropy source which is /dev/urandom by default.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Since more virtio devices will be added and this code can be reused
for any type of virtio device.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
This patch expand the device registration to add a new virtio-net
device in case the user provide the appropriate flag --net from the
command line.
If the flag is provided, the code will parse the TAP interface name
and the expected MAC address from the command line. The VM will be
connected to the provided TAP interface, and it will communicate the
MAC address to the virtio-net driver.
If the flag is not provided, the VM will not register any virtio-net
device, therefore it will not have any connectivity with the host.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add the BSD and Apache license.
Make all crosvm references point to the BSD license.
Add the right copyrights and identifier to our VMM code.
Add Intel copyright to the vm-virtio and pci crates.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to have proper output from the serial, we need to setup the
terminal in raw mode. When the VM is shutting down, it is also the
VMM responsibility to set the terminal back into canonical mode if we
don't want to get any weird behavior from the shell.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to let the user choose which kernel parameters to append, the
kernel boot parameters can be now specified from the command line.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the new virtio-blk support, this commit allows any user to
specify a --disk option in order to select the rootfs it wants to
use for the VM.
For now it assumes the partition 3 /dev/vd3 is the one where we can
find the rootfs.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
After the virtio-blk device support has been introduced in the
previous commit, the vmm need to rely on this new device to boot
from disk images instead of initrd built into the kernel.
In order to achieve the proper support of virtio-blk, this commit
had to handle a few things:
- Register an ioevent fd for each virtqueue. This important to be
notified from the virtio driver that something has been written
on the queue.
- Fix the retrieval of 64bits BAR address. This is needed to provide
the right address which need to be registered as the notification
address from the virtio driver.
- Fix the write_bar and read_bar functions. They were both assuming
to be provided with an address, from which they were trying to
find the associated offset. But the reality is that the offset is
directly provided by the Bus layer.
- Register a new virtio-blk device as a virtio-pci device from the
vm.rs code. When the VM is started, it expects a block device to
be created, using this block device as the VM rootfs.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduce emulation of i8042 device to allow the guest to stop the
VM by issuing a reset event.
The device has been copied over from the Crosvm code base, relying on
the commit 0268e26e1ac9e09aa51d733482c5df139cd8d588.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
An exit event is required to be created and handled for the purpose
of letting the guest kernel stop the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of handling stdin in its own separate loop, we use a generic
one that can be reused for other events handling.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
After starting all vCPUs, we loop for STDIN input.
We need a more scalable eventfd control loop, obviously.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Based on the Firecracker devices crate from commit 9cdb5b2.
It is a trimmed down version compared to the Firecracker one, to remove
a bunch of pulled dependencies (logger, metrics, rate limiter, etc...).
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>