For all VFIO devices, map all non-emulated MMIO regions to
the vfio container to allow PCIe P2P between all VFIO devices
on the virtual machine. This is required for a wide variety of
advanced GPU workloads such as GPUDirect P2P (DMA between two
GPUs), GPUDirect RDMA (DMA between a GPU and an IB device).
Signed-off-by: Thomas Barrett <tbarrett@crusoeenergy.com>
This fixes all typos found by the typos utility with respect to the config file.
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
The fixup_msix_region() function added in
a718716831
made the assumption that MSI-X was always available. This is the case
with many VFIO devices and all our virtio devices but created regression
with MSI devices.
Simply return the existing region size if MSI-X is not supported by the
device.
Fixes: #5649
Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
Currently, vfio device fails to initialize as the msix-cap region in BAR
is mapped as RW region.
To resolve the initialization issue, this commit avoids mapping the
msix-cap region in the BAR. However, this solution introduces another
problem where aligning the msix table offset in the BAR to the page
size may cause overlap with the MMIO RW region, leading to reduced
performance. By enlarging the entire region in the BAR and relocating
the msix table to achieve page size alignment, this problem can be
overcomed effectively.
Fixes: #5292
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
In current implementation, memory region used in vfio is assumed to
align to 4k which may cause error when the PAGE_SIZE is not 4k, like on
Arm, it can be 16k and 64k.
Remove this assumption and align memory resource used by vfio to
PAGE_SIZE then vfio can run on host with 64k PAGE_SIZE.
Fixes: #5292
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The information about the identifier related to a Snapshot is only
relevant from the BTreeMap perspective, which is why we can get rid of
the duplicated identifier in every Snapshot structure.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There's no reason to carry a HashMap of SnapshotDataSection per
Snapshot. And given we now provide at most one SnapshotDataSection per
Snapshot, there's no need to keep the id part of the SnapshotDataSection
structure.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is some preliminatory work for moving both VfioUser and Vfio to the
new restore design.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code for restoring a VirtioPciDevice has been updated, including the
dependencies VirtioPciCommonConfig, MsixConfig and PciConfiguration.
It's important to note that both PciConfiguration and MsixConfig still
have restore() implementations because Vfio and VfioUser devices still
rely on the old way for restore.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are PCI extended capabilities that can't be passed through the VM
as they would be unusable from a guest perspective. That's why we
introduce a way to patch what is returned to the guest when the PCI
configuration space is accessed. The list of patches is created from the
parsing of the extended capabilities in that case, and particularly
based on the presence of the SRIOV and Resizable BAR capabilities.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Steven Dake <sdake@lambdal.com>
If the BAR for the VFIO device is marked as prefetchable on the
underlying device ensure that the BAR exposed through PciConfiguration
is also marked as prefetchable.
Fixes problem where NVIDIA devices are not usable with PCI VFIO
passthrough. See related NVIDIA kernel driver bug:
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/344.
Fixes: #4451
Signed-off-by: Steven Dake <sdake@lambdal.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When restoring a VM, the restore codepath will take care of mapping the
MMIO regions based on the information from the snapshot, rather than
having the mapping being performed during device creation.
When the device is created, information such as which BARs contain the
MSI-X tables are missing, preventing to perform the mapping of the MMIO
regions.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the VfioCommon implementation, the VfioPciDevice now implements
the Migratable trait.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introduces the common code to handle one aspect of the migration
support. Particularly, the ability to store VMM internal states related
to such device. The internal state of the device will happen later in a
dedicated patchset that will implement the VFIO migration API.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We can return prematurely from 'map_mmio_regions()' (e.g. when a mmap call
failed for vfio or 'create_user_memory_region()' failed for vfio-user)
without updating the 'MmioRegion::user_memory_regions' with the
information of previous successful mmaps, which in turn would cause mmap
leaks particularly for the case of hotplug where the 'vmm' thread will
keep running. To fix the issue, let's keep 'MmioRegion::user_memory_regions'
updated right after successful mmap calls.
Fixes: #4068
Signed-off-by: Bo Chen <chen.bo@intel.com>
Reorganizing the code to leverage the same mechanics implemented for
vfio-user and aimed at supporting sparse memory mappings for a single
region.
Relying on the capabilities returned by the vfio-ioctls crate, we create
a list of sparse areas depending if we get SPARSE_MMAP or MSIX_MAPPABLE
capability, or a single sparse area in case we couldn't find any
capability.
The list of sparse areas is then used to create both the memory mappings
in the Cloud Hypervisor address space and the hypervisor user memory
regions.
This allowed for the simplification of the MmioRegion structure.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Similar to what's being supported for vfio devices, vfio-user devices
may also have BARs that need multiple kvm user memory regions,
e.g. device regions with `VFIO_REGION_INFO_CAP_SPARSE_MMAP`.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Extend VfioCommon to simplify the overall code, and also in preparation
for supporting the restore code path.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend VfioCommon structure to own the legacy interrupt manager. This
will be useful for implementing the restore code path.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend VfioCommon structure to own the MSI interrupt manager. This will
be useful for implementing the restore code path.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We need to split the parsing functions into one function dedicated to
the actual parsing and a second function for initializing the interrupt
type. This will be useful on the restore path as the parsing won't be
needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case a list of resources is provided to allocate_bars(), it directly
means we're restoring some existing BARs. That's why we shouldn't share
the codepath that creates BARs from scratch as we don't need to interact
with the device to retrieve the information.
Whenever resources are provided, we simply iterate over the list of
possible BAR indexes and create the BARs if the resource could be found.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of defining some very generic resources as PioAddressRange or
MmioAddressRange for each PCI BAR, let's move to the new Resource type
PciBar in order to make things clearer. This allows the code for being
more readable, but also removes the need for hard assumptions about the
MMIO and PIO ranges. PioAddressRange and MmioAddressRange types can be
used to describe everything except PCI BARs. BARs are very special as
they can be relocated and have special information we want to carry
along with them.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to make the code more consistent and easier to read, we remove
the former tuple that was used to describe a BAR, replacing it with the
existing structure PciBarConfiguration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code was quite unclear regarding the type of index that was being
used regarding a BAR. This is improved by differenciating register
indexes and BAR indexes more clearly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By adding a new method id() to the PciDevice trait, we allow the caller
to retrieve a unique identifier. This is used in the context of BAR
relocation to identify the device being relocated, so that we can update
the DeviceTree resources for all PCI devices (and not only
VirtioPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Relying on the list of resources, VfioCommon is now able to allocate the
BARs at specific addresses. This will be useful for restoring VFIO and
vfio-user devices.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Updating the way of restoring BAR addresses for virtio-pci by providing
a more generic approach that will be reused for other PciDevice
implementations (i.e VfioPcidevice and VfioUserPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
AArch64 does not use IOBAR, and current code of panics the whole VMM if
we need to allocate the IOBAR.
This commit checks if IOBAR is enabled before the arch conditional code
of IOBAR allocation and if the IOBAR is not enabled, we can just skip
the IOBAR allocation and do nothing.
Fixes: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/3479
Signed-off-by: Henry Wang <Henry.Wang@arm.com>
Now that vfio-ioctls correctly exposes the list of capabilities related
to each region, Cloud Hypervisor can decide to mmap a region based on
the presence or absence of MSIX_MAPPABLE. Instead of blindly mmap'ing
the region, we check if the MSI-X table or PBA is present on the BAR,
and if that's the case, we look for MSIX_MAPPABLE.
If MSIX_MAPPABLE is present, we can go ahead and mmap the entire region.
If MSIX_MAPPABLE is not present, we simply ignore the mmap'ing of this
region as it wouldn't be supported.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since each segment must have a non-overlapping memory range associated
with it the device memory must be equally divided amongst all segments.
A new allocator is used for each segment to ensure that BARs are
allocated from the correct address ranges. This requires changes to
PciDevice::allocate/free_bars to take that allocator and when
reallocating BARs the correct allocator must be identified from the
ranges.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to get proper performance out of hardware that might be passed
through the VM with VFIO, we decide to try mapping any region marked as
mappable, ignoring the failure and moving on to the next region. When
the mapping succeeds, we establish a list of user memory regions so that
we enable only subparts of the global mapping through the hypervisor.
This allows MSI-X table and PBA to keep trapping accesses from the guest
so that the VMM can properly emulate MSI-X.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Taking advantage of the refactored VFIO code implement a new
VfioUserPciDevice that wraps the client for vfio-user and exposes the
BusDevice and PciDevice into the VMM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
After the refactoring to split the common VFIO code out for vfio-user
there were some inconsistencies in the error handling. Correct this so
that the error is independent of the transport (hardware vs user) VFIO
and migrate to anyhow/thiserror in the process. Some unused errors from
earlier refactoring have also been removed.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Returning a reference is not possible for the vfio-user code as it is
constructed for the function.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rename the wrapper trait and structs since this will be used for more
than reading the PCI config.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>