In anticipation of inserting the DeviceManager on the IO/MMIO buses,
the DeviceManager must implement the BusDevice trait.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Create a small method that will perform both hotplug of all the devices
identified by PCIU bitmap, and then perform the hotunplug of all the
devices identified by the PCID bitmap.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The _EJ0 method provides the guest OS a way to notify the VMM that the
device has been properly ejected from the guest OS. Only after this
point, the VMM can fully remove the device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new PHPR device in the DSDT table introduces some specific
operation regions and the associated fields.
PCIU stands for "PCI up", which identifies PCI devices that must be
added.
PCID stands for "PCI down", which identifies PCI devices that must be
removed.
B0EJ stands for "Bus 0 eject", which identifies which device on the bus
has been ejected by the guest OS.
Thanks to these fields, the VMM and the guest OS can communicate while
performing hotplug/hotunplug operations.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adds the DVNT method to the PCI0 device in the DSDT table. This new
method is responsible for checking each slot and notify the guest OS if
one of the slots is supposed to be added or removed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces the ACPI support for describing the 32 device
slots attached to the main PCI host bridge.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation of the support for device hotplug, this commit moves the
DeviceManager object into an Arc<Mutex<>> when the DeviceManager is
being created. The reason is, we need the DeviceManager to implement the
BusDevice trait and then provide it to the IO bus, so that IO accesses
related to device hotplug can be handled correctly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We want to prevent from losing interrupts while they are masked. The
way they can be lost is due to the internals of how they are connected
through KVM. An eventfd is registered to a specific GSI, and then a
route is associated with this same GSI.
The current code adds/removes a route whenever a mask/unmask action
happens. Problem with this approach, KVM will consume the eventfd but
it won't be able to find an associated route and eventually it won't
be able to deliver the interrupt.
That's why this patch introduces a different way of masking/unmasking
the interrupts, simply by registering/unregistering the eventfd with the
GSI. This way, when the vector is masked, the eventfd is going to be
written but nothing will happen because KVM won't consume the event.
Whenever the unmask happens, the eventfd will be registered with a
specific GSI, and if there's some pending events, KVM will trigger them,
based on the route associated with the GSI.
Suggested-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Recently, vhost_user_block gained the ability of actively polling the
queue, a feature that can be disabled with the poll_queue property.
This change adds this property to DiskConfig, so it can be used
through the "disk" argument.
For the moment, it can only be used when vhost_user=true, but this
will change once virtio-block gets the poll_queue feature too.
Fixes: #787
Signed-off-by: Sergio Lopez <slp@redhat.com>
Fix "readonly" and "wce" defaults in cloud-hypervisor.yaml to match
their respective defaults in config.rs:DiskConfig.
Signed-off-by: Sergio Lopez <slp@redhat.com>
It's missing a few knobs (readonly, vhost, wce) that should be exposed
through the rest API.
Fixes: #790
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The kernel does not adhere to the ACPI specification (probably to work
around broken hardware) and rather than busy looping after requesting an
ACPI reset it will attempt to reset by other mechanisms (such as i8042
reset.)
In order to trigger a reset the devices write to an EventFd (called
reset_evt.) This is used by the VMM to identify if a reset is requested
and make the VM reboot. As the reset_evt is part of the VMM and reused
for both the old and new VM it is possible for the newly booted VM to
immediately get reset as there is an old event sitting in the EventFd.
The simplest solution is to "drain" the reset_evt EventFd on reboot to
make sure that there is no spurious events in the EventFd.
Fixes: #783
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Relying on the latest vm-memory version, including the freshly
introduced structure GuestMemoryAtomic, this patch replaces every
occurrence of Arc<ArcSwap<GuestMemoryMmap> with
GuestMemoryAtomic<GuestMemoryMmap>.
The point is to rely on the common RCU-like implementation from
vm-memory so that we don't have to do it from Cloud-Hypervisor.
Fixes#735
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If no socket is supplied when enabling "vhost_user=true" on "--disk"
follow the "exe" path in the /proc entry for this process and launch the
network backend (via the vmm_path field.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When a virtio-fs device is created with a dedicated shared region, by
default the region should be mapped as PROT_NONE so that no pages can be
faulted in.
It's only when the guest performs the mount of the virtiofs filesystem
that we can expect the VMM, on behalf of the backend, to perform some
new mappings in the reserved shared window, using PROT_READ and/or
PROT_WRITE.
Fixes#763
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If no socket is supplied when enabling "vhost_user=true" on "--net"
follow the "exe" path in the /proc entry for this process and launch the
network backend (via the vmm_path field.)
Currently this only supports creating a new tap interface as the network
backend also only supports that.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It is necessary to do this at the start of the VMM execution rather than
later as it must be done in the main thread in order to satisfy the
checks required by PTRACE_MODE_READ_FSCREDS (see proc(5) and
ptrace(2))
The alternative is to run as CAP_SYS_PTRACE but that has its
disadvantages.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
If the ioctl syscall KVM_CREATE_VM gets interrupted while creating the
VM, it is expected that we should retry since EINTR should not be
considered a standard error.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Having the InterruptManager trait depend on an InterruptType forces
implementations into supporting potentially very different kind of
interrupts from the same code base. What we're defining through the
current, interrupt type based create_group() method is a need for having
different interrupt managers for different kind of interrupts.
By associating the InterruptManager trait to an interrupt group
configuration type, we create a cleaner design to support that need as
we're basically saying that one interrupt manager should have the single
responsibility of supporting one kind of interrupt (defined through its
configuration).
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We create 2 different interrupt managers for separately handling
creation of legacy and MSI interrupt groups.
Doing so allows us to have a cleaner interrupt manager and IOAPIC
initialization path. It also prepares for an InterruptManager trait
design improvement where we remove the interrupt source type dependency
by associating an interrupt configuration type to the trait.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
A reference to the VmFd is stored on the AddressManager so it is not
necessary to pass in the VmInfo into all methods that need it as it can
be obtained from the AddressManager.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The DeviceManager has a reference to the MemoryManager so use that to
get the GuestMemoryMmap rather than the version stored in the VmInfo
struct.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove the use of vm_info in methods to get the config and instead use
the config stored on the DeviceManager itself.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove some in/out parameters and instead rely on them as members of the
&mut self parameter. This prepares the way to more easily store state on
the DeviceManager.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Remove some in/out parameters and instead rely on them as members of the
&mut self parameter. A follow-up commit will change the callee functions
that create the devices themselves.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Modify these functions to take an &mut self and become methods on
DeviceManager. This allows the removal of some in/out parameters and
leads the way to further refactoring and simplification.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The MemoryManager should only be included on the I/O bus when doing ACPI
builds as that is the only time it will be interrogated.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Currently the MemoryManager is only used on the ACPI code paths after
the DeviceManager has been created. This will change in a future commit
as part of the refactoring so for now always include it but name it with
underscore prefix to indicate it might not always be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that devices attached to the virtual IOMMU are described through
virtio configuration, there is no need for the DeviceManager to store
the list of IDs for all these devices. Instead, things are handled
locally when PCI devices are being added.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of relying on the ACPI tables to describe the devices attached
to the virtual IOMMU, let's use the virtio topology, as the ACPI support
is getting deprecated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a socket and vhost_user parameter to this option so that the same
configuration option can be used for both virtio-block and
vhost-user-block. For now it is necessary to specify both vhost_user
and socket parameters as auto activation is not yet implemented. The wce
parameter for supporting "Write Cache Enabling" is also added to the
disk configuration.
The original command line parameter is still supported for now and will
be removed in a future release.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add a socket and vhost_user parameter to this option so that the same
configuration option can be used for both virtio-net and vhost-user-net.
For now it is necessary to specify both vhost_user and socket parameters
as auto activation is not yet implemented. The original command line
parameter is still supported for now.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit improves the existing virtio-blk implementation, allowing
for better I/O performance. The cost for the end user is to accept
allocating more vCPUs to the virtual machine, so that multiple I/O
threads can run in parallel.
One thing to notice, the amount of vCPUs must be egal or superior to the
amount of queues dedicated to the virtio-blk device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The number of queues and the size of each queue were not configurable.
In anticipation for adding multiqueue support, this commit introduces
some new parameters to let the user decide about the number of queues
and the queue size.
Note that the default values for each of these parameters are identical
to the default values used for vhost-user-blk, that is 1 for the number
of queues and 128 for the queue size.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Devices like virtio-pmem and virtio-fs require some dedicated memory
region to be mapped. The memory mapping from the DeviceManager is being
replaced by the usage of MmapRegion from the vm-memory crate.
The unmap will happen automatically when the MmapRegion will be dropped,
which should happen when the DeviceManager gets dropped.
Fixes#240
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Move GED device reporting of required device type to scan into an MMIO
region rather than an I/O port.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than have the MemoryManager device sit on the I/O bus allocate
space for MMIO and add it to the MMIO bus.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This commit relies on the interrupt manager and the resulting interrupt
source group to abstract the knowledge about KVM and how interrupts are
updated and delivered.
This allows the entire "devices" crate to be freed from kvm_ioctls and
kvm_bindings dependencies.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The interrupt manager is passed to the IOAPIC creation, and the IOAPIC
now creates an InterruptSourceGroup for MSI interrupts based on it.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By introducing a new InterruptManager dedicated to the IOAPIC, we don't
have to solve the chicken and eggs problem about which of the
InterruptManager or the Ioapic should be created first. It's also
totally fine to have two interrupt manager instances as they both share
the same list of GSI routes and the same allocator.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
vhost_user_blk already has it, so it's only fair to give it to
virtio-blk too. Extend DiskConfig with a 'direct' property, honor
it while opening the file backing the disk image, and pass it to
vm_virtio::RawFile.
Fixes#631
Signed-off-by: Sergio Lopez <slp@redhat.com>
vhost_user_blk already has it, so it's only fair to give it to
virtio-blk too. Extend DiskConfig with a 'readonly' properly, and pass
it to vm_virtio::Block.
Signed-off-by: Sergio Lopez <slp@redhat.com>
The build is run against "--all-features", "pci,acpi", "pci" and "mmio"
separately. The clippy validation must be run against the same set of
features in order to validate the code is correct.
Because of these new checks, this commit includes multiple fixes
related to the errors generated when manually running the checks.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no need for assign_irq() or assign_msix() functions from the
PciDevice trait, as we can see it's never used anywhere in the codebase.
That's why it's better to remove these methods from the trait, and
slightly adapt the existing code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since the InterruptManager is never stored into any structure, it should
be passed as a reference instead of being cloned.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit replaces the way legacy interrupts were handled with the
brand new implementation of the legacy InterruptSourceGroup for KVM.
Additionally, since it removes the last bit relying on the Interrupt
trait, the trait and its implementation can be removed from the
codebase.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit replaces the way legacy interrupts were handled with the
brand new implementation of the legacy InterruptSourceGroup for KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit replaces the way legacy interrupts were handled with the
brand new implementation of the legacy InterruptSourceGroup for KVM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Relying on the previous commits, the legacy interrupt implementation can
be completed. The IOAPIC handler is used to deliver the interrupt that
will be triggered through the trigger() method.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By having a reference to the IOAPIC, the KvmInterruptManager is going
to be able to initialize properly the legacy interrupt source group.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to be able to use the InterruptManager abstraction with
virtio-mmio devices, this commit introduces InterruptSourceGroup's
skeleton for legacy interrupts.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When KvmInterruptManager initializes a new InterruptSourceGroup, it's
only for PCI_MSI_IRQ case that it needs to allocate the GSI and create a
new InterruptRoute. That's why this commit moves the general code into
the specific use case.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When the base InterruptIndex is different from 0, the loop allocating
GSI and HashMap entries won't work as expected. The for loop needs to
start from base, but the limit must be base+count so that we allocate
a number of "count" entries.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to let the InterruptManager be shared across both PCI and MMIO
devices, this commit moves the initialization earlier in the code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
After refactoring a common function is used to setup these slots and
that function takes care of allocating a new slot so it is not necessary
to reserve the initial region slots.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Based on all the previous changes, we can at this point replace the
entire interrupt management with the implementation of InterruptManager
and InterruptSourceGroup traits.
By using KvmInterruptManager from the DeviceManager, we can provide both
VirtioPciDevice and VfioPciDevice a way to pick the kind of
InterruptSourceGroup they want to create. Because they choose the type
of interrupt to be MSI/MSI-X, they will be given a MsiInterruptGroup.
Both MsixConfig and MsiConfig are responsible for the update of the GSI
routes, which is why, by passing the MsiInterruptGroup to them, they can
still perform the GSI route management without knowing implementation
details. That's where the InterruptSourceGroup is powerful, as it
provides a generic way to manage interrupt, no matter the type of
interrupt and no matter which hypervisor might be in use.
Once the full replacement has been achieved, both SystemAllocator and
KVM specific dependencies can be removed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
After the skeleton of InterruptManager and InterruptSourceGroup traits
have been implemented, this new commit takes care of fully implementing
the content of KvmInterruptManager (InterruptManager trait) and
MsiInterruptGroup (InterruptSourceGroup).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit introduces an empty implementation of both InterruptManager
and InterruptSourceGroup traits, as a proper basis for further
implementation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Callbacks are not the most Rust idiomatic way of programming. The right
way is to use a Trait to provide multiple implementation of the same
interface.
Additionally, a Trait will allow for multiple functions to be defined
while using callbacks means that a new callback must be introduced for
each new function we want to add.
For these two reasons, the current commit modifies the existing
VirtioInterrupt callback into a Trait of the same name.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because MsixConfig will be responsible for updating KVM GSI routes at
some point, it is necessary that it can access the list of routes
contained by gsi_msi_routes.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because MsixConfig will be responsible for updating the KVM GSI routes
at some point, it must have access to the VmFd to invoke the KVM ioctl
KVM_SET_GSI_ROUTING.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The point here is to let MsixConfig take care of the GSI allocation,
which means the SystemAllocator must be passed from the vmm crate all
the way down to the pci crate.
Once this is done, the GSI allocation and irq_fd creation is performed
by MsixConfig directly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Because we will need to share the same list of GSI routes across
multiple PCI devices (virtio-pci, VFIO), this commit moves the creation
of such list to a higher level location in the code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Use RawFile as backend instead of File. This allows us to abstract
the access to the actual image with a specialized layer, so we have a
place where we can deal with the low-level peculiarities.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Doing I/O on an image opened with O_DIRECT requires to adhere to
certain restrictions, requiring the following elements to be aligned:
- Address of the source/destination memory buffer.
- File offset.
- Length of the data to be read/written.
The actual alignment value depends on various elements, and according
to open(2) "(...) there is currently no filesystem-independent
interface for an application to discover these restrictions (...)".
To discover such value, we iterate through a list of alignments
(currently, 512 and 4096) calling pread() with each one and checking
if the operation succeeded.
We also extend RawFile so it can be used as a backend for QcowFile,
so the later can be easily adapted to support O_DIRECT too.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Update the common part in net_util.rs under vm-virtio to add mq
support, meanwhile enable mq for virtio-net device, vhost-user-net
device and vhost-user-net backend. Multiple threads will be created,
one thread will be responsible to handle one queue pair separately.
To gain the better performance, it requires to have the same amount
of vcpus as queue pair numbers defined for the net device, due to
the cpu affinity.
Multiple thread support is not added for vhost-user-net backend
currently, it will be added in future.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Add num_queues and queue_size for virtio-net device to make them configurable,
while add the associated options in command line.
Update cloud-hypervisor.yaml with the new options for NetConfig.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Add support to allow VMMs to open the same tap device many times, it will
create multiple file descriptors meanwhile.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>
Since the common parts are put into net_util.rs under vm-virtio,
refactoring code for virtio-net device, vhost-user-net device
and backend to shrink the code size and improve readability
meanwhile.
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com>