Now that we rely on pop_descriptor_chain() rather than iter() to iterate
over a queue, there's no more borrow on the queue itself, meaning we can
invoke add_used() directly for the iteration loop. This simplifies the
processing of the queues for each virtio device, and bring some possible
performance improvement given we don't have to iterate twice over the
list of descriptors to invoke add_used().
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Using pop_descriptor_chain() is much more appropriate than iter() since
it recreates the iterator every time, avoiding the queue to be borrowed
and allowing the virtio-net implementation to match all the other ones.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The new virtio-queue version introduced some breaking changes which need
to be addressed so that Cloud Hypervisor can still work with this
version.
The most important change is about removing a handle to the guest memory
from the Queue, meaning the caller has to provide the guest memory
handle for multiple methods from the QueueT trait.
One interesting aspect is that QueueT has been widely extended to
provide every getter and setter we need to access and update the Queue
structure without having direct access to its internal fields.
This patch ports all the virtio and vhost-user devices to this new crate
definition. It also updates both vhost-user-block and vhost-user-net
backends based on the updated vhost-user-backend crate. It also updates
the fuzz directory.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Whenever a virtio reset happens, the vhost-user backend should be
notified that the vring should be stopped. This is performed by calling
GET_VRING_BASE on the appropriate queue indexes.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Rather than relying on the amount of queues to enable or disable the
queue that have been activated, we rely on the actual queue indexes
provided through the tuple including the queue index, the Queue and the
EventFd. By storing the list of indexes, we simplify the code and also
make it more accurate in case some queues aren't activated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of passing separately a list of Queues and the equivalent list
of EventFds, we consolidate these two through a tuple along with the
queue index.
The queue index can be very useful if looking for the actual index
related to the queue, no matter if other queues have been enabled or
not.
It's also convenient to have the EventFd associated with the Queue so
that we don't have to carry two lists with the same amount of items.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When preparing the activator, we must provide the correct queue index to
clone the right EventFd associated with the queue.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It's not mandatory for the virtio-fs driver to enable all virtqueues
provided by the backend since all it needs is one request queue to work
correctly. Therefore we lower the minimal amount of enabled queues to 1.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
And along with virtio-queue, we must also bump vhost-user-backend from
0.3.0 to 0.5.0 (since it relies on virtio-queue 0.4.0).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The vhost-user backend was always provided the maximum queue size but
this is incorrect. Instead it must be informed of the actual queue size
that has been negotiated with the virtio driver running in the guest.
This ensures proper functioning of vhost-user-block with the Rust
Hypervisor Firmware, which uses a hardcoded queue size of 16.
Partially fixes#4285
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The latest vhost-user specification describes VHOST_USER_RESET_OWNER
command as deprecated with the following explanation:
This is no longer used. Used to be sent to request disabling all
rings, but some back-ends interpreted it to also discard connection
state (this interpretation would lead to bugs). It is recommended that
back-ends either ignore this message, or use it to disable all rings.
Also, it's been observed that when using either Rust Hypervisor Firmware
or EDK2 OVMF firmware with SPDK (using the block device as the boot
disk), the virtio reset that happens when the firmware no longer needs
to access the block device caused a failure by triggering the command
VHOST_USER_RESET_OWNER.
For all these reasons, this patch simplifies the virtio reset
implementation by simply disabling the virtqueues and no longer calling
into VHOST_USER_RESET_OWNER.
Partially fixes#4285
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This check is new in the beta version of clippy and exists to avoid
potential deadlocks by highlighting when the test in an if or for loop
is something that holds a lock. In many cases we would need to make
significant refactorings to be able to pass this check so disable in the
affected crates.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: accessing first element with `data.get(0)`
--> virtio-devices/src/transport/pci_device.rs:1055:34
|
1055 | if let Some(v) = data.get(0) {
| ^^^^^^^^^^^ help: try: `data.first()`
|
= note: `#[warn(clippy::get_first)]` on by default
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#get_first
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: you are deriving `PartialEq` and can implement `Eq`
--> vmm/src/serial_manager.rs:59:30
|
59 | #[derive(Debug, Clone, Copy, PartialEq)]
| ^^^^^^^^^ help: consider deriving `Eq` as well: `PartialEq, Eq`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#derive_partial_eq_without_eq
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to ensure that the virtio device thread is spawned from the vmm
thread we use an asynchronous activation mechanism for the virtio
devices. This change optimises that code so that we do not need to
iterate through all virtio devices on the platform in order to find the
one that requires activation. We solve this by creating a separate short
lived VirtioPciDeviceActivator that holds the required state for the
activation (e.g. the clones of the queues) this can then be stored onto
the device manager ready for asynchronous activation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This reverts commit f160572f9d.
There has been increased flakiness around the live migration tests since
this was merged. Speculatively reverting to see if there is increased
stability.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to ensure that the virtio device thread is spawned from the vmm
thread we use an asynchronous activation mechanism for the virtio
devices. This change optimises that code so that we do not need to
iterate through all virtio devices on the platform in order to find the
one that requires activation. We solve this by creating a separate short
lived VirtioPciDeviceActivator that holds the required state for the
activation (e.g. the clones of the queues) this can then be stored onto
the device manager ready for asynchronous activation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Latest cargo beta version raises warnings about unused macro rules.
Simply remove them to fix the beta build.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There is no need to include serde_derive separately,
as it can be specified as serde feature instead.
Signed-off-by: Maksym Pavlenko <pavlenko.maksym@gmail.com>
Rely on the newly added helper from vm-virtio crate to keep cloning the
list of Queue structures.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Improve the request parsing/handling code by allowing an error status to
be returned back to the guest driver before we return an error
internally.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Extend the Domain structure to store the information about each domain
being in bypass mode or not. Based on this new information, the address
translation of the virtio devices is performed according to the bypass
mode of each domain. And both MAP/UNMAP requests are generating errors
in case the domain has been previously set to bypass mode.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for associating more than mappings with a domain, we
factorize the list of mappings associated with a domain behind a
dedicated Domain structure. We also update the field name so that it
reads better in the code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Exposing the VIRTIO_IOMMU_F_BYPASS_CONFIG feature to the guest, which
allows to update the bypass global knob through virtio configuration.
Based on the value of this global knob, the address translations for
endpoints that have not been added to a domain is allowed with a simple
identity mapping.
By default, we enable the bypass mode for all endpoints that are not
attached to any domain.
Fixes#3987
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the VIRTIO specification, we must be able to support multiple
endpoints per domain. This is fixed along with the introduction of some
simplification regarding how we can retrieve the external mapping
directly based on the endpoint.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If the guest has not activated the virtio-mem device then reject an
attempt to resize using it.
Fixes: #4001
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Instead of defining some very generic resources as PioAddressRange or
MmioAddressRange for each PCI BAR, let's move to the new Resource type
PciBar in order to make things clearer. This allows the code for being
more readable, but also removes the need for hard assumptions about the
MMIO and PIO ranges. PioAddressRange and MmioAddressRange types can be
used to describe everything except PCI BARs. BARs are very special as
they can be relocated and have special information we want to carry
along with them.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to make the code more consistent and easier to read, we remove
the former tuple that was used to describe a BAR, replacing it with the
existing structure PciBarConfiguration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The code was quite unclear regarding the type of index that was being
used regarding a BAR. This is improved by differenciating register
indexes and BAR indexes more clearly.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By adding a new method id() to the PciDevice trait, we allow the caller
to retrieve a unique identifier. This is used in the context of BAR
relocation to identify the device being relocated, so that we can update
the DeviceTree resources for all PCI devices (and not only
VirtioPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Updating the way of restoring BAR addresses for virtio-pci by providing
a more generic approach that will be reused for other PciDevice
implementations (i.e VfioPcidevice and VfioUserPciDevice).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Rust 2021 edition has a few improvements over the 2018 edition. Migrate
the project to 2021 edition by following recommended migration steps.
Luckily, the code itself doesn't require fixing.
Bump MSRV to 1.56 as it is required by the 2021 edition. Also fix the
clap build dependency to make Cloud Hypervisor build again.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
It doesn't matter if we're trying to translate a GVA or a GPA address,
but in both cases we must error out if the address couldn't be
translated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Whenever a virtio device is placed behind a vIOMMU, we have some code in
pci_common_config.rs to translate the queue addresses (descriptor table,
available ring and used ring) from GVA to GPA, so that they can be used
correctly.
But in case of vDPA, we also need to provide the queue addresses to the
vhost backend. And since the vhost backend deals with consistent IOVAs,
all addresses being provided should be GVAs if the device is placed
being a vIOMMU. For that reason, we perform a translation of the queue
addresses back from GPA to GVA if necessary, and only to be provided to
the vhost backend.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In case an external mapping would have been added after the virtio-iommu
device has been activated, it would have simply be ignored because the
code wasn't using a shared object between the vmm thread and the iommu
thread. This behavior is only triggered on the hotplug codepath, and
only if the hotplugged device is placed behind the virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In anticipation for the vDPA need to translate a GPA back into a GVA, we
extend the existing trait DmaRemapping and AccessPlatform to perform
such operation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Renaming translate() to translate_gva() to clarify we want to translate
a GVA address into a GPA.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By enabling the VIRTIO feature VIRTIO_F_IOMMU_PLATFORM for all
vhost-user devices when needed, we force the guest to use the DMA API,
making these devices compatible with TDX. By using DMA API, the guest
triggers the TDX codepath to share some of the guest memory, in
particular the virtqueues and associated buffers so that the VMM and
vhost-user backends/processes can access this memory.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that we rely on vhost v0.4.0, which contains the fix for
get_iova_range(), we don't need the workaround anymore, and we can
actually call into the dedicated function.
Fixes#3861
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Rely on newly released versions of the vhost and vhost-user-backend
crates from rust-vmm.
The new vhost version includes the fixes needed for vDPA.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The file descriptor provided to fs_slave_map() and fs_slave_io() is
passed as a AsRawFd trait, meaning the caller owns it. For that reason,
there's no need for these functions to close the file descriptor as it
will be closed later on anyway.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
vDPA is a kernel framework introduced fairly recently in order to handle
devices complying with virtio specification on their datapath, while the
control path is vendor specific. For the datapath, that means the
virtqueues are handled through DMA directly between the hardware and the
guest, while the control path goes through the vDPA framework,
eventually exposed through a vhost-vdpa device.
vDPA, like VFIO, aims at achieving baremetal performance for devices
that are passed into a VM. But unlike VFIO, it provides a simpler/better
framework for achieving migration. Because the DMA accesses between the
device and the guest are going through virtio queues, migration can be
achieved way more easily, and doesn't require each device driver to
implement the migration support. In the VFIO case, each vendor is
expected to provide an implementation of the VFIO migration framework,
which makes things harder as it must be done for each and every device.
So to summarize the point is to support migration for hardware devices
through which we can achieve baremetal performances.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Given that some virtio device might need some DMA handling, we provide a
way to store this through the VirtioPciDevice layer, so that it can be
accessed when the PCI device is removed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the VIRTIO_F_EVENT_IDX handling now conducted inside the
virtio-queue crate it is necessary to activate the functionality on
every queue if it is negotiatated. Otherwise this leads to a failure of
the guest to signal to the host that there is something in the available
queue as the queue's internal state has not been configured correctly.
Fixes: #3829
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move to release version v0.2.0 for both vm-virtio and vhost-user-backend
crates rather than relying on their main branch, as they might be
subject to breaking changes.
Fixes#3800
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
After writing to an address, Windows 11 on ARM64 unconditionally reads
it back. It is harmless. Drop the error message to avoid spamming.
Fixes: #3732
Signed-off-by: Wei Liu <liuwe@microsoft.com>
error: writing `&mut Vec` instead of `&mut [_]` involves a new object
where a slice will do
--> virtio-devices/src/transport/pci_common_config.rs:93:17
|
93 | queues: &mut
Vec<Queue<GuestMemoryAtomic<GuestMemoryMmap>>>,
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: change this to: `&mut [Queue<GuestMemoryAtomic<GuestMemoryMmap>>]`
|
= note: `-D clippy::ptr-arg` implied by `-D warnings`
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#ptr_arg
Signed-off-by: Akira Moroo <retrage01@gmail.com>
For vhost-user devices, we don't want to loose the vhost-user protocol
feature through the negotiation between guest and device. Since we know
VIRTIO has no knowledge of the vhost-user protocol feature, there is no
way it would ever be acknowledged by the guest. For that reason, we
create each vhost-user device with the set of acked features containing
the vhost-user protocol feature is this one was part of the available
list.
Having the set of acked features containing this bit allows for solving
a bug that was happening through the migration process since the
vhost-user protocol feature wasn't explicitely enabled.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement the VIRTIO_BALLOON_F_REPORTING feature, indicating to the
guest it can report set of free pages. A new virtqueue dedicated for
receiving the information about the free pages is created. The VMM
releases the memory by punching holes with fallocate() if the guest
memory is backed by a file, and madvise() the host about the ranges of
memory that shouldn't be needed anymore.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding a new parameter free_page_reporting=on|off to the balloon device
so that we can enable the corresponding feature from virtio-balloon.
Running a VM with a balloon device where this feature is enabled allows
the guest to report pages that are free from guest's perspective. This
information is used by the VMM to release the corresponding pages on the
host.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Improving the existing code for better readability and in anticipation
for adding an additional virtqueue for the free page reporting feature.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This should not occur as ioeventfd is used for notification. Such an
error message would have made the discovery of the underlying cause of
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to clearly decouple when the migration is started compared to
when the dirty logging is started, we introduce a new method to the
Migratable trait. This clarifies the semantics as we don't end up using
start_dirty_log() for identifying when the migration has been started.
And similarly, we rely on the already existing complete_migration()
method to know when the migration has been ended.
A bug was reported when running a local migration with a vhost-user-net
device in server mode. The reason was because the migration_started
variable was never set to "true", since the start_dirty_log() function
was never invoked.
Signed-off-by: lizhaoxin1 <Lxiaoyouling@163.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that all the preliminary work has been merged to make Cloud
Hypervisor work with the upstream crate virtio-queue from
rust-vmm/vm-virtio repository, we can move the whole codebase and remove
the local copy of the virtio-queue crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This new trait simplifies the address translation of a GuestAddress by
having GuestAddress implementing it.
The three crates virtio-devices, block_util and net_util have been
updated accordingly to rely on this new trait, helping with code
readability and limiting the amount of duplicated code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Moving the whole codebase to rely on the AccessPlatform definition from
vm-virtio so that we can fully remove it from virtio-queue crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Moving away from the virtio-queue mechanism for descriptor address
translation. Instead, we enable the new mechanism added to every
VirtioDevice implementation, by setting the AccessPlatform trait if one
can be found.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since we're trying to move away from the translation happening in the
virtio-queue crate, the device itself is performing the address
translation when needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add a new method set_access_platform() to the VirtioDevice trait in
order to allow an AccessPlatform trait to be setup on any virtio device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Upon the enablement of the queue by the guest, we perform a translation
of the descriptor table, the available ring and used ring addresses
prior to enabling the device itself. This only applies to the case where
the device is placed behind a vIOMMU, which is the reason why the
translation is needed. Indeed, the addresses allocated by the guest are
IOVAs which must be translated into GPAs before we can access the queue.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Instead of relying on the virtio-queue crate to store the information
about the MSI-X vectors for each queue, we handle this directly from the
PCI transport layer.
This is the first step in getting closer to the upstream version of
virtio-queue so that we can eventually move fully to the upstream
version.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When freeing memory sometimes glibc will attempt to read
"/proc/sys/vm/overcommit_memory" to find out how it should release the
blocks. This happens sporadically with Cloud Hypervisor but has been
seen in use. It is not necessary to add the read() syscall to the list
as it is already included in the virtio devices common set. Similarly
the vCPU and vmm threads already have both these in the allowed list.
Fixes: #3609
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Whenever the backing file of our virtio-block device is opened with
O_DIRECT, there's a requirement about the buffer address and size to be
aligned to the sector size.
We know virtio-block requests are sector aligned in terms of size, but
we must still check if the buffer address is. In case it's not, we
create an intermediate buffer that will be passed through the system
call. In case of a write operation, the content of the non-aligned
buffer must be copied beforehand, and in case of a read operation, the
content of the aligned buffer must be copied to the non-aligned one
after the operation has been completed.
Fixes#3587
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This crate contains up to date definition of the Queue, AvailIter,
DescriptorChain and Descriptor structures forked from the upstream
crate rust-vmm/vm-virtio 27b18af01ee2d9564626e084a758a2b496d2c618.
The following patches have been applied on top of this base in order to
make it work correctly with Cloud Hypervisor requirements:
- Add MSI vector field to the Queue
In order to help with MSI/MSI-X support, it is convenient to store the
value of the interrupt vector inside the Queue directly.
- Handle address translations
For devices with access to data in memory being translated, we add to
the Queue the ability to translate the address stored in the
descriptor.
It is very helpful as it performs the translation right after the
untranslated address is read from memory, avoiding any errors from
happening from the consumer's crate perspective. It also allows the
consumer to reduce greatly the amount of duplicated code for applying
the translation in many different places.
- Add helpers for Queue structure
They are meant to help crate's consumers getting/setting information
about the Queue.
These patches can be found on the 'ch' branch from the Cloud Hypervisor
fork: https://github.com/cloud-hypervisor/vm-virtio.git
This patch takes care of updating the Cloud Hypervisor code in
virtio-devices and vm-virtio to build correctly with the latest version
of virtio-queue.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When forwarding an epoll event from the unix muxer to the
targeted connection event handler, the eventset the connection
registered is forwarded instead of the actual epoll
operation (IN/OUT).
For example, if the connection was registered for EPOLLIN,
and receives an EPOLLOUT, the connection will actually handle
an EPOLLOUT.
This is the root cause of previous bug, which caused the
introduction of some workarounds (i.e: handling ewouldblock
when reading after receiving EPOLLIN, which should never happen).
When matching the connection, we retrieve and use the evset of
the connection instead of the one passed as a parameter.
The compiler does not complain for an unused variable because
it was first logged in a debug! statement.
This is an unfortunate naming mistake that caused a lot of problems.
Fixes#3497
Signed-off-by: Eduard Kyvenko <eduard.kyvenko@gmail.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If the disk is backed by a block device on the host a non-default
topology will be available and that topology can be advertised by virtio
block.
Fixes: #3262
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This reverts commit 58d25b3ccc.
This change introduced a regression when running iperf with the guest
running as the server:
marvin:~/src/cloud-hypervisor ((58d25b3c...))$ iperf -c 192.168.249.2
------------------------------------------------------------
Client connecting to 192.168.249.2, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.249.1 port 47078 connected with 192.168.249.2 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.40 sec 14.0 MBytes 11.3 Mbits/sec
marvin:~/src/cloud-hypervisor ((58d25b3c...))$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 128 KByte (default)
------------------------------------------------------------
[ 1] local 192.168.249.1 port 5001 connected with 192.168.249.2 port 42866
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.01 sec 51.2 GBytes 44.0 Gbits/sec
Fixes: #3450
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Because anyhow version 1.0.46 has been yanked, let's move back to the
previous version 1.0.45.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By merging receive buffers through the VIRTIO_NET_F_MRG_RXBUF feature,
as well as enabling the use of indirect descriptors through
VIRTIO_RING_F_INDIRECT_DESC feature, we achieve better throughput for
the virtio-net device without hurting its latency.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since each segment must have a non-overlapping memory range associated
with it the device memory must be equally divided amongst all segments.
A new allocator is used for each segment to ensure that BARs are
allocated from the correct address ranges. This requires changes to
PciDevice::allocate/free_bars to take that allocator and when
reallocating BARs the correct allocator must be identified from the
ranges.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the decision on whether to use a 64-bit bar up to the DeviceManager
so that it can use both the device type (e.g. block) and the PCI segment
ID to decide what size bar should be used.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Relying on the vm-virtio/virtio-queue crate from rust-vmm which has been
copied inside the Cloud Hypervisor tree, the entire codebase is moved to
the new definition of a Queue and other related structures.
The reason for this move is to follow the upstream until we get some
agreement for the patches that we need on top of that to make it
properly work with Cloud Hypervisor.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This was causing some issues because of the use of 2 different versions
for the vm-memmory crate. We'll wait for all dependencies to be properly
resolved before we move to 0.7.0.
This reverts commit 76b6c62d07.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
These structs are not read on the VMM side but are used in communication
with the guest.
As identified by the new beta clippy.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
There's no need to patch the vhost crate anymore since the fixes we were
looking for have been released as part of 0.2.0 on crates.io.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Setting the reply_ack should depend on the set of acknowledged features
containing the REPLY_ACK flag.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to support correctly the snapshot/restore and migration use
cases, we must be careful with the ranges that we discard by punching
holes. On restore, there might be some ranges already plugged in,
meaning they should not be discarded. That's why we loop over the list
of blocks to discard only the ranges that are marked as unplugged.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By creating the BlocksState object in the MemoryManager, we can directly
provide it to the virtio-mem device when being created. This will allow
the MemoryManager through each VirtioMemZone to have a handle onto the
blocks that are plugged at any point in time.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This is going to be useful to let virtio-mem report the list of ranges
that are currently plugged, so that both snapshot/restore and migration
will copy only what is needed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This will be helpful to support the creation of a MemoryRangeTable from
virtio-mem, as it uses 2M pages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding the snapshot/restore support along with migration as well,
allowing a VM with virtio-mem devices attached to be properly
migrated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement the infrastructure that lets a virtio-mem device map the guest
memory into the device. This is necessary since with virtio-mem zones
memory can be added or removed and the vfio-user device must be
informed.
Fixes: #3025
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
For vfio-user the mapping handler is per device and needs to be removed
when the device in unplugged.
For VFIO the mapping handler is for the default VFIO container (used
when no vIOMMU is used - using a vIOMMU does not require mappings with
virtio-mem)
To represent these two use cases use an enum for the handlers that are
stored.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the snapshot/restore support along with migration as well,
allowing a VM with a virtio-balloon device attached to be properly
migrated.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The MSI IOVA address on X86 and AArch64 is different.
This commit refactored the code to receive the MSI IOVA address and size
from device_manager, which provides the actual IOVA space data for both
architectures.
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
For most use cases, there is no need to create multiple VFIO containers
as it causes unwanted behaviors. Especially when passing multiple
devices from the same IOMMU group, we need to use the same container so
that it can properly list the groups that have been already opened. The
correct logic was already there in vfio-ioctls, but it was incorrectly
used from our VMM implementation.
For the special case where we put a VFIO device behind a vIOMMU, we must
create one container per device, as we need to control the DMA mappings
per device, which is performed at the container level. Because we must
keep one container per device, the vIOMMU use case prevents multiple
devices attached to the same IOMMU group to be passed through the VM.
But this is a limitation that we are fine with, especially since the
vIOMMU doesn't let us group multiple devices in the same group from a
guest perspective.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When a pty is resized (using the TIOCSWINSZ ioctl -- see ioctl_tty(2)),
the kernel will send a SIGWINCH signal to the pty's foreground process
group to notify it of the resize. This is the only way to be notified
by the kernel of a pty resize.
We can't just make the cloud-hypervisor process's process group the
foreground process group though, because a process can only set the
foreground process group of its controlling terminal, and
cloud-hypervisor's controlling terminal will often be the terminal the
user is running it in. To work around this, we fork a subprocess in a
new process group, and set its process group to be the foreground
process group of the pty. The subprocess additionally must be running
in a new session so that it can have a different controlling
terminal. This subprocess writes a byte to a pipe every time the pty
is resized, and the virtio-console device can listen for this in its
epoll loop.
Alternatives I considered were to have the subprocess just send
SIGWINCH to its parent, and to use an eventfd instead of a pipe.
I decided against the signal approach because re-purposing a signal
that has a very specific meaning (even if this use was only slightly
different to its normal meaning) felt unclean, and because it would
have required using pidfds to avoid race conditions if
cloud-hypervisor had terminated, which added complexity. I decided
against using an eventfd because using a pipe instead allows the child
to be notified (via poll(2)) when nothing is reading from the pipe any
more, meaning it can be reliably notified of parent death and
terminate itself immediately.
I used clone3(2) instead of fork(2) because without
CLONE_CLEAR_SIGHAND the subprocess would inherit signal-hook's signal
handlers, and there's no other straightforward way to restore all signal
handlers to their defaults in the child process. The only way to do
it would be to iterate through all possible signals, or maintain a
global list of monitored signals ourselves (vmm:vm::HANDLED_SIGNALS is
insufficient because it doesn't take into account e.g. the SIGSYS
signal handler that catches seccomp violations).
Signed-off-by: Alyssa Ross <hi@alyssa.is>
This prepares us to be able to handle console resizes in the console
device's epoll loop, which we'll have to do if the output is a pty,
since we won't get SIGWINCH from it.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Musl often uses mmap to allocate memory where Glibc would use brk.
This has caused seccomp violations for me on the API and signal
handling threads.
Signed-off-by: Alyssa Ross <hi@alyssa.is>
As well as reducing the amount of code this also improves the binary
size slightly:
cargo bloat --release -n 2000 --bin cloud-hypervisor | grep virtio_devices::seccomp_filters::get_seccomp_rules
Before:
0.1% 0.2% 7.8KiB virtio_devices virtio_devices::seccomp_filters::get_seccomp_rules
After:
0.0% 0.1% 3.0KiB virtio_devices virtio_devices::seccomp_filters::get_seccomp_rules
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Introduce a common solution for spawning the virtio threads which will
make it easier to add the panic handling.
During this effort I discovered that there were no seccomp filters
registered for the vhost-user-net thread nor the vhost-user-block
thread. This change also incorporates basic seccomp filters for those as
part of the refactoring.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the processing of the input from stdin, PTY or file from the VMM
thread to the existing virtio-console thread. The handling of the resize
of a virtio-console has not changed but the name of the struct used to
support that has been renamed to reflect its usage.
Fixes: #3060
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
vhdx_sync.rs in block_util implements traits to represent the vhdx
crate as a supported block device in the cloud hypervisor. The vhdx
is added to the block device list in device_manager.rs at the vmm
crate so that it can automatically detect a vhdx disk and invoke the
corresponding crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Fazla Mehrab <akm.fazla.mehrab@intel.com>
We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Introducing a new structure VhostUserCommon allowing to factorize a lot
of the code shared between the vhost-user devices (block, fs and net).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We cannot let vhost-user devices connect to the backend when the Block,
Fs or Net object is being created during a restore/migration. The reason
is we can't have two VMs (source and destination) connected to the same
backend at the same time. That's why we must delay the connection with
the vhost-user backend until the restoration is performed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introducing a new function to factorize a small part of the
initialization that is shared between a full reinitialization and a
restoration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to prevent the vhost-user devices from reconnecting to the
backend after the migration has been successfully performed, we make
sure to kill the thread in charge of handling the reconnection
mechanism.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
During a migration, the vhost-user device talks to the backend to
retrieve the dirty pages. Once done with this, a snapshot will be taken,
meaning there's no need to communicate with the backend anymore. Closing
the communication is needed to let the destination VM being able to
connect to the same backend.
That's why we shutdown the communication with the backend in case a
migration has been started and we're asked for a snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This anticipates the need for creating a new Blk, Fs or Net object
without having performed the connection with the vhost-user backend yet.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It was incorrect to call Vec::from_raw_parts() on the address pointing
to the shared memory log region since Vec is a Rust specific structure
that doesn't directly translate into bytes. That's why we use the same
function from std::slice in order to create a proper slice out of the
memory region, which is then copied into a Vec.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that the common vhost-user code can handle logging dirty pages
through shared memory, we need to advertise it to the vhost-user
backends with the protocol feature VHOST_USER_PROTOCOL_F_LOG_SHMFD.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Allow vsocks to connect to Unix sockets on the host running
cloud-hypervisor with enabled seccomp.
Reported-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Tested-by: Franz Girlich <franz.girlich@tu-ilmenau.de>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
This doesn't really affect the build as we ship a Cargo.lock with fixed
versions in. However for clarity it makes sense to use fixed versions
throughout and let dependabot update them.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the common vhost-user code for starting logging dirty pages when
the migration is started, and its counterpart for stopping, as well as
the code in charge of retrieving the bitmap of the dirty pages that have
been logged.
All these functions are meant to be leveraged from vhost-user devices.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding a simple field `migration_support` to VhostUserHandle in order to
store the information about the device supporting migration or not. The
value of this flag depends on the feature set negotiated with the
backend. It's considered as supporting migration if VHOST_F_LOG_ALL is
present in the virtio features and if VHOST_USER_PROTOCOL_F_LOG_SHMFD is
present in the vhost-user protocol features.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
backend like SPDK required to know how many virt queues to be handled
before gets VHOST_USER_SET_INFLIGHT_FD message.
fix dpdk core dump while processing vhost_user_set_inflight_fd:
#0 0x00007fffef47c347 in vhost_user_set_inflight_fd (pdev=0x7fffe2895998, msg=0x7fffe28956f0, main_fd=545) at ../lib/librte_vhost/vhost_user.c:1570
#1 0x00007fffef47e7b9 in vhost_user_msg_handler (vid=0, fd=545) at ../lib/librte_vhost/vhost_user.c:2735
#2 0x00007fffef46bac0 in vhost_user_read_cb (connfd=545, dat=0x7fffdc0008c0, remove=0x7fffe2895a64) at ../lib/librte_vhost/socket.c:309
#3 0x00007fffef45b3f6 in fdset_event_dispatch (arg=0x7fffef6dc2e0 <vhost_user+8192>) at ../lib/librte_vhost/fd_man.c:286
#4 0x00007ffff09926f3 in rte_thread_init (arg=0x15ee180) at ../lib/librte_eal/common/eal_common_thread.c:175
Signed-off-by: Arafatms <arafatms@outlook.com>
This patch adds all the seccomp rules missing for MSHV.
With this patch MSFT internal CI runs with seccomp enabled.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The vhost-user-net backend needs to prepare all queues before enabling vring.
For example, DPVGW will report 'the RX queue can't find' error, if we enable
vring immediately after kicking it out.
Signed-off-by: Arafatms <arafatms@outlook.com>
Adding the support for snapshot/restore feature for all supported
vhost-user devices.
The complexity of vhost-user-fs device makes it only partially
compatible with the feature. When using the DAX feature, there's no way
to store and remap what was previously mapped in the DAX region. And
when not using the cache region, if the filesystem is mounted, it fails
to be properly restored as this would require a special command to let
the backend know that it must remount what was already mounted before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch moves all vhost-user common functions behind a new structure
VhostUserHandle. There is no functional changes intended, the only goal
being to prepare for storing information through this new structure,
limiting the amount of parameters that are needed for each function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There are some seccomp rules needed for MSHV
in virtio-devices but not for KVM. We only want to
add those rules based on MSHV feature guard.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The cloud hypervisor tells the VM and the backend to support the PACKED_RING feature,
but it actually processes various variables according to the split ring logic, such
as last_avail_index. Eventually it will cause the following error (SPDK as an example):
vhost.c: 516:vhost_vq_packed_ring_enqueue: *ERROR*: descriptor has been used before
vhost_blk.c: 596:process_blk_task: *ERROR*: ====== Task 0x200113784640 req_idx 0 failed ======
vhost.c: 629:vhost_vring_desc_payload_to_iov: *ERROR*: gpa_to_vva((nil)) == NULL
Signed-off-by: Arafatms <arafatms@outlook.com>
If writing to the TAP returns EAGAIN then listen for the TAP to be
writable. When the TAP becomes writable attempt to process the TX queue
again.
Fixes: #2807
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This dependency bump needed some manual handling since the API changed
quite a lot regarding some RawFd being changed into either File or
AsRawFd traits.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>