Commit Graph

8033 Commits

Author SHA1 Message Date
Sebastien Boeuf
49ef201cd1 vfio: pci: Provide the right MSI-X table offset
When reading from or writing to the MSI-X table, the function provided
by the PCI crate expects the offset to start from the beginning of the
table. That's why it is VFIO specific code to be responsible for
providing the right offset, which means it needs to be the offset
substracted by the beginning of the MSI-X table offset.

This bug was not discovered until we tested VFIO with some device where
the MSI-X table was placed on a BAR at an offset different from 0x0.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 09:45:20 +02:00
Sebastien Boeuf
a548a01423 pci: Fix MSI-X table and PBA offsets
The offsets returned by the table_offset() and pba_offset() function
were wrong as they were shifting the value by 3 bits. The MSI-X spec
defines the MSI-X table and PBA offsets as being defined on 3-31 bits,
but this does not mean it has to be shifted. Instead, the address is
still on 32 bits and assumes the LSB bits 0-2 are set to 0.

VFIO was working fine with devices were the MSI-X offset were 0x0, but
the bug was found on a device where the offset was non-null.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 09:45:20 +02:00
Sebastien Boeuf
baec27698e vm-virtio: Don't break from epoll loop on EINTR
The existing code taking care of the epoll loop was too restrictive as
it was propagating the error returned from the epoll_wait() syscall, no
matter what was the error. This causes the epoll loop to be broken,
leading to a non-functional virtio device.

This patch enforces the parsing of the returned error and prevent from
the error propagation in case it is EINTR, which stands for Interrupted.
In case the epoll loop is interrupted, it is appropriate to retry.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 08:37:34 +01:00
Sebastien Boeuf
1a484a82f9 vmm: Don't break from epoll loop on EINTR
The existing code taking care of the epoll loop was too restrictive as
it was propagating the error returned from the epoll_wait() syscall, no
matter what was the error. This causes the epoll loop to be broken,
leading to the VM termination.

This patch enforces the parsing of the returned error and prevent from
the error propagation in case it is EINTR, which stands for Interrupted.
In case the epoll loop is interrupted, it is appropriate to retry.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 08:37:34 +01:00
Sebastien Boeuf
532f6a96f3 vmm: Factorize VM related information into a structure
In order to fix the clippy error complaining about the number of
arguments passed to a function exceeding the maximum of 7 arguments,
this patch factorizes those parameters into a more global one called
VmInfo.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 08:35:16 +01:00
Sebastien Boeuf
c0756c429d vmm: Increase memory slot from virtio-pmem
Since virtio-pmem uses a KVM user memory region, it needs to increment
the slot index in use to prevent from any conflict with further VFIO
allocations (used for mapping mappable memory BARs).

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-02 08:35:16 +01:00
Sebastien Boeuf
8c4c162109 arch: x86_64: Set MTRR default memory type as WB
In the context of VFIO, we use Vt-d, which means we rely on an IOMMU.
Depending on the IOMMU capability, and in particular if it is not able
to perform SC (Snooping Control), the memory will not be tagged as WB
by KVM, but instead the vCPU will rely on its MTRR/PAT MSRs to find the
appropriate way of interact with specific memory regions.

Because when Vt-d is not involved KVM sets the memory as WB (write-back)
the VMM should set the memory default as WB. That's why this patch sets
the MSR MTRRdefType with the default memory type being WB.

One thing that it is worth noting is that we might have to specifically
create some UC (uncacheable) regions if we see some issues with the
ranges corresponding to the MMIO ranges that should trap.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-08-01 20:14:46 +02:00
Rob Bradford
d52684450f tests: Add Ubuntu Bionic version of test_simple_launch
Introduce a new DiskConfig implementation for Ubuntu Bionic with
different cloud init preparation details and use this when testing with
test_simple_launch.

Adjust the memory expectation for downwards as the EFI boot results in a
slightly different memory map. Also enable serial port as Ubuntu does
not support a virtio-console based boot.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-08-01 14:37:04 +02:00
Rob Bradford
facc3b303a tests: Add Bionic to integration test script
Download Bionic image and convert to raw (the QCOW version of the file
doesn't currently work) for running the integration tests.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-08-01 14:37:04 +02:00
Rob Bradford
09aced9ed1 tests: Use logical name for disk paths
Rather than embed the disks in a vector and have integer indicies into
the vector for the different disks instead abstract this through an enum
type used on the DiskConfig trait.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-08-01 14:37:04 +02:00
Rob Bradford
56c4b7000a tests: Refactor integration tests to support different distributions
Extract the network configuration into its own struct and also extract
the prepare_files() and prepare_cloudinit() functions into a struct for
the Clear Linux distribution.

This struct is behind a trait that is used by the Guest implementation
to prepare the files.

This will allow a different implementation to be used for the Ubuntu
disk files.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-08-01 14:37:04 +02:00
Sebastien Boeuf
d18c8d4c8c vfio: pci: Add support for expansion ROM BAR
Relying on the newly added code in the pci crate, the vfio crate can now
properly expose an expansion ROM BAR if the device has one.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-31 09:28:29 +02:00
Sebastien Boeuf
d217089b54 pci: Add support for expansion ROM BAR
The expansion ROM BAR can be considered like a 32-bit memory BAR with a
slight difference regarding the amount of reserved bits at the beginning
of its 32-bit value. Bit 0 indicates if the BAR is enabled or disabled,
while bits 1-10 are reserved. The remaining upper 21 bits hold the BAR
address.

This commit extends the pci crate in order to support expansion ROM BAR.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-31 09:28:29 +02:00
Sebastien Boeuf
347f8a036b vfio: pci: Mask multi function device bit
In order to support VFIO for devices supporting multiple functions,
we need to mask the multi-function bit (bit 7 from Header Type byte).
Otherwise, the guest kernel ends up tryng to enumerate those devices.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-31 09:28:29 +02:00
Sebastien Boeuf
b6ae2ccda4 pci: Disable multiple functions
Every PCI device is exposed as a separate device, on a specific PCI
slot, and we explicitely don't support to expose a device as a multi
function device. For this reason, this patch makes sure the enumeration
of the PCI bus will not find some multi function device.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-31 09:28:29 +02:00
Rob Bradford
f86b9dd95e scripts: Add Ubuntu cloud-init data
Add cloud-init data for Ubuntu and introduce a convenience script that
can be used to generate cloud-init disk images for manual testing.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-07-30 14:26:26 +02:00
Rob Bradford
be199e5560 tests: Move Clear Linux cloud-init files to subdirectory
And rename the default user from "admin" to "cloud" as the admin user
clashes with a standard user and group name on Ubuntu.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-07-30 14:26:26 +02:00
Sebastien Boeuf
98d7955e34 vm-virtio: Add support for notifying about virtio config update
As per the VIRTIO specification, every virtio device configuration can
be updated while the guest is running. The guest needs to be notified
when this happens, and it can be done in two different ways, depending
on the type of interrupt being used for those devices.

In case the device uses INTx, the allocated IRQ pin is shared between
queues and configuration updates. The way for the guest to differentiate
between an interrupt meant for a virtqueue or meant for a configuration
update is tied to the value of the ISR status field. This field is a
simple 32 bits bitmask where only bit 0 and 1 can be changed, the rest
is reserved.

In case the device uses MSI/MSI-X, the driver should allocate a
dedicated vector for configuration updates. This case is much simpler as
it only requires the device to send the appropriate MSI vector.

The cloud-hypervisor codebase was not supporting the update of a virtio
device configuration. This patch extends the existing VirtioInterrupt
closure to accept a type that can be Config or Queue, so that based on
this type, the closure implementation can make the right choice about
which interrupt pin or vector to trigger.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-29 15:34:37 +01:00
Samuel Ortiz
93b77530c7 release-notes: Add v0.1.0 notes
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-25 17:58:33 +02:00
Samuel Ortiz
fa41ddd94f arch: Add a Reserved memory region to the memory hole
We add a Reserved region type at the end of the memory hole to prevent
32-bit devices allocations to overlap with architectural address ranges
like IOAPIC, TSS or APIC ones.

Eventually we should remove that reserved range by allocating all the
architectural ranges before letting 32-bit devices use the memory hole.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-25 11:45:38 +01:00
Samuel Ortiz
299d887856 arch: Add SubRegion memory type
We want to be able to differentiate between memory regions that must be
managed separately from the main address space (e.g. the 32-bit memory
hole) and ones that are reserved (i.e. from which we don't want to allow
the VMM to allocate address ranges.

We are going to use a reserved memory region for restricting the 32-bit
memory hole from expanding beyond the IOAPIC and TSS addresses.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-25 11:45:38 +01:00
Samuel Ortiz
792cc27435 vfio: Propagate the KVM routes setting error
This will trigger a logged error once we have an actual logger.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
421b896ab7 vfio: Don't expose an Interrupt Pin
Since our VFIO code does not support pin based interrupt, but only MSI
and MSI-X, it is cleaner to not expose any Interrupt Pin to the guest by
setting its value to 0.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
2f802880c0 vfio: Disable the ROM expansion BAR
Until the codebase can properly expose the ROM BAR into the guest, it is
better to disable it for now, returning always 0 when the register is
being read.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
e18052120a vfio: Fix Memory BAR alignment
The IO BAR alignment was already set to 4 bytes, this patch simply added
a comment for it.

The Memory BAR alignment was also set to the right value, but it was not
explained why 0x1000 was needed, and also why 0x10 could sometimes be
used as correct alignment.
A Memory BAR must be aligned at least on 16 bytes since the first 4 bits
are dedicated to some specific information about the BAR itself. But in
case a BAR is identified as mappable from VFIO, this means our VMM might
memory map it into the VMM address space, and set KVM accordingly using
the ioctl KVM_SET_USER_MEMORY_REGION. In case of KVM, we have to take
into account that it expects addresses to be page aligned, which means
4K in this case.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
d92d797896 vfio: Update memory slot index to support multiple VFIO devices
In order to correctly support multiple VFIO devices, we need to
increment the memory slot index every time it is being used to set some
user memory region through KVM. That's why the mem_slot parameter is
made mutable.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
b9f677c46c vmm: Fix the memory slot index
The memory slot index provided to the DeviceManager was wrong since
only the RAM memory regions are set as user memory regions to KVM.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
b5eab43aa5 vfio: Create a global KVM VFIO device for all VFIO devices
KVM does not support multiple KVM VFIO devices to be created when
trying to support multiple VFIO devices. This commit creates one
global KVM VFIO device being shared with every VFIO device, which
makes possible the support for passing several devices through the
VM.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
0ff074d2b8 vm-allocator: Fix potential allocation errors
There is one corner case which was not properly handled by the current
code from our AddressAllocator. If both the address start (from the
next range) and the requested region size are already aligned on the
same value as "alignment", when doing the substract of the requested
size + alignment, the returned address is already aligned. The problem
is that we might end up overlapping with an existing range since the
check between the available delta and the requested size does not take
into account a full extra alignment.

By substracting the requested size + alignment - 1 from the address
start of the next range, we ensure this kind of corner case would not
happen since the address won't be naturally aligned and after some
adjustment from the function align_address(), the correct start address
will be returned.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Sebastien Boeuf
927861ced2 pci: Fix end of address space check
The check performed on the end address was wrong since the end address
was actually the address right after the end. To get the right end
address, we need to add (region size - 1) to the start address.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-25 11:45:38 +01:00
Rob Bradford
1971c94e4e tests: Adjust down entropy expectation
The newer kernel is resulting in entropy being slightly lower than
previously. Adjust the expected entropy downwards.

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-07-24 16:26:33 +02:00
Rob Bradford
ebe04f6db9 tests: Use custom kernel for all tests
This should reduce the integration testing time considerably. When a
custom kernel is no longer required we can pull kernel from tarball
again.

Fixes: #100

Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-07-24 16:26:33 +02:00
Samuel Ortiz
3cc6f48c31 docs: Add VFIO usage example
Fixes: #117

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 07:17:03 -07:00
Samuel Ortiz
46eaea1627 README: Fix kernel command line console argument
We use the virtio console device now.

Fixes: #116

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 07:16:48 -07:00
Rob Bradford
1f6f52249e build: Upload release binary on tag
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
2019-07-24 12:49:35 +01:00
Samuel Ortiz
5ae3144f5b tests: Add VFIO integration test
The VFIO integration test first boots a QEMU guest and then assigns the
QEMU virtio-pci networking device into a nested cloud-hypervisor guest.
We then check that we can ssh into the nested guest and verify that it's
running with the right kernel command line.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Samuel Ortiz
4d16ca8ae7 vmm: Support direct device assignment
With the VFIO crate, we can now support directly assigned PCI devices
into cloud-hypervisor guests.

We support assigning multiple host devices, through the --device command
line parameter. This parameter takes the host device sysfs path.

Fixes: #60

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Chao Peng
b746dd7116 vfio: Map MMIO regions into the guest
VFIO explictly tells us if a MMIO region can be mapped into the guest
address space or not. Except for MSI-X table BARs, we try to map them
into the guest whenever VFIO allows us to do so. This avoids unnecessary
VM exits when the guest tries to access those regions.

Signed-off-by: Zhang, Xiong Y <xiong.y.zhang@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Sebastien Boeuf
c93d5361b8 vfio: pci: Build the KVM routes
We track all MSI and MSI-X capabilities changes, which allows us to also
track all MSI and MSI-X table changes.

With both pieces of information we can build kvm irq routing tables and
map the physical device MSI/X vectors to the guest ones. Once that
mapping is in place we can toggle the VFIO IRQ API accordingly and
enable disable MSI or MSI-X interrupts, from the physical device up to
the guest.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Sebastien Boeuf
20f0116111 vfio: pci: Track MSI and MSI-X capabilities
In order to properly manage the VFIO device interrupt settings, we need
to keep track of both MSI and MSI-X PCI config capabilities changes.

When the guest programs the device for interrupt delivery, it writes to
the MSI and MSI-X capabilities. This information must be trapped and
cached in order to map the physical device interrupt delivery path to
the guest one. In other words, tracking MSI and MSI-X capabilites will
allow us to accurately build the KVM interrupt routes.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Samuel Ortiz
db5b4763c2 vfio: Initial PCI support
This brings the initial PCI support to the VFIO crate.
The VfioPciDevice is the main structure and holds an inner VfioDevice.

VfioPciDevice implements the PCI trait, leaving the IRQ assignments
empty as this will be driven by both the guest and the VFIO PCI device,
not by the VMM.

As we must trap BAR programming from the guest (We don't want to program
the actual device with guest addresses), we use our local PCI
configuration cache to read and write BARs.

Signed-off-by: Zhang, Xiong Y <xiong.y.zhang@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Samuel Ortiz
2cec3aad7f vfio: VFIO API wrappers and helpers
The Virtual Function I/O (VFIO) kernel subsystem exposes a vast and
relatively complex userspace API. This commit abstracts and simplifies
this API into both an internal and external API.

The external API is to be consumed by VFIO device implementation through
the VfioDevice structure. A VfioDevice instance can:

- Enable and disable all interrupts (INTX, MSI and MSI-X) on the
  underlying VFIO device.
- Read and write all of the VFIO device memory regions.
- Set the system's IOMMU tables for the underlying device.

Signed-off-by: Zhang, Xiong Y <xiong.y.zhang@intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Samuel Ortiz
5372554ed4 vfio-bindings: Initial commit
The default bindings are generated from the 5.0.0 Linux userspace API.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-24 11:55:08 +02:00
Samuel Ortiz
4e48309660 vm: Factorize all virtio devices creation routines
Our DeviceManager::new() routine is reaching north of 250 lines.
For simplicity and readbility sake, extract all virtio devices creation
code into their own routines.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2019-07-23 08:41:37 +01:00
fazlamehrab
8ba54af71d vm-virtio: Add integration test for virtio console device
Two integration tests are added for testing the implemented virtio
console device for single port operation. One checks the presence
and the simple stdout operation. The other test checks the stdout
on file (option: file) using virtio console.

Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
2019-07-22 23:08:56 +01:00
fazlamehrab
24438e0390 vm-virtio: Enable the vmm support for virtio-console
To use the implemented virtio console device, the users can select one
of the three options ("off", "tty" or "file=/path/to/the/file") with
the command line argument "--console". By default, the console is
enabled as a device named "hvc0" (option: tty). When "off" option is
used, the console device is not added to the VM configuration at all.

Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
2019-07-22 23:08:56 +01:00
fazlamehrab
577d44c8eb vm-virtio: Add virtio console device for single port operation
The virtio console device is a console for the communication between
the host and guest userspace. It has two parts: the device and the
driver. The console device is implemented here as a virtio-pci device
to the guest. On the other side, the guest OS expected to have a
character device driver which provides an interface to the userspace
applications.

The console device can have multiple ports where each port has one
transmit queue and one receive queue. The current implementation only
supports one port. For data IO communication, one or more empty
buffers are placed in the receive queue for incoming data, and
outgoing characters are placed in the transmit queue. Details spec
can be found from the following link.

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.pdf#e7

Apart from the console, for the communication between guest and host,
the Cloud Hypervisor has a legacy serial device implemented. However,
the implementation of a console device lets us be independent of legacy
pin-based interrupts without losing the logs and access to the VM.

Signed-off-by: A K M Fazla Mehrab <fazla.mehrab.akm@intel.com>
2019-07-22 23:08:56 +01:00
Sebastien Boeuf
f98a69f42e vm-allocator: Introduce an MMIO hole address allocator
With this new AddressAllocator as part of the SystemAllocator, the
VMM can now decide with finer granularity where to place memory.

By allocating the RAM and the hole into the MMIO address space, we
ensure that no memory will be allocated by accident where the RAM or
where the hole is.
And by creating the new MMIO hole address space, we create a subset
of the entire MMIO address space where we can place 32 bits BARs for
example.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-22 09:51:16 -07:00
Sebastien Boeuf
a761b820c7 vm-allocator: Fix the aligned address check
The requested address for a range can be the base of the entire
address space, this is a valid use case.

In particular, when creating an MMIO address space of 0-64GiB, we
might want to create a range of 0-1GiB if the RAM of our VM is 1G.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-22 09:51:16 -07:00
Sebastien Boeuf
709148803e vm-allocator: Fix free range allocation
This patch fixes the function first_available_range() responsible
for finding the first range that could fit the requested size.

The algorithm was working, that is allocating ranges from the end
of the address space because we created an empty region right at the
end. But the problem is, the VMM might request for some specific
allocations at fixed address to allocate the RAM for example. In this
case, the RAM range could be 0-1GiB, which means with the previous
algorithm, the new available range would have been found right after
1GiB.

This is not the intended behavior, and that's why the algorithm has
been fixed by this patch, making sure to walk down existing ranges
starting from the end.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2019-07-22 09:51:16 -07:00