We are relying on applying empty 'seccomp' filters to support the
'--seccomp false' option, which will be treated as an error with the
updated 'seccompiler' crate. This patch fixes this issue by explicitly
checking whether the 'seccomp' filter is empty before applying the
filter.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Introducing a new structure VhostUserCommon allowing to factorize a lot
of the code shared between the vhost-user devices (block, fs and net).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We cannot let vhost-user devices connect to the backend when the Block,
Fs or Net object is being created during a restore/migration. The reason
is we can't have two VMs (source and destination) connected to the same
backend at the same time. That's why we must delay the connection with
the vhost-user backend until the restoration is performed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Introducing a new function to factorize a small part of the
initialization that is shared between a full reinitialization and a
restoration.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In order to prevent the vhost-user devices from reconnecting to the
backend after the migration has been successfully performed, we make
sure to kill the thread in charge of handling the reconnection
mechanism.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
During a migration, the vhost-user device talks to the backend to
retrieve the dirty pages. Once done with this, a snapshot will be taken,
meaning there's no need to communicate with the backend anymore. Closing
the communication is needed to let the destination VM being able to
connect to the same backend.
That's why we shutdown the communication with the backend in case a
migration has been started and we're asked for a snapshot.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This anticipates the need for creating a new Blk, Fs or Net object
without having performed the connection with the vhost-user backend yet.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
It was incorrect to call Vec::from_raw_parts() on the address pointing
to the shared memory log region since Vec is a Rust specific structure
that doesn't directly translate into bytes. That's why we use the same
function from std::slice in order to create a proper slice out of the
memory region, which is then copied into a Vec.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that the common vhost-user code can handle logging dirty pages
through shared memory, we need to advertise it to the vhost-user
backends with the protocol feature VHOST_USER_PROTOCOL_F_LOG_SHMFD.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Allow vsocks to connect to Unix sockets on the host running
cloud-hypervisor with enabled seccomp.
Reported-by: Philippe Schaaf <philippe.schaaf@secunet.com>
Tested-by: Franz Girlich <franz.girlich@tu-ilmenau.de>
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Adding the common vhost-user code for starting logging dirty pages when
the migration is started, and its counterpart for stopping, as well as
the code in charge of retrieving the bitmap of the dirty pages that have
been logged.
All these functions are meant to be leveraged from vhost-user devices.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adding a simple field `migration_support` to VhostUserHandle in order to
store the information about the device supporting migration or not. The
value of this flag depends on the feature set negotiated with the
backend. It's considered as supporting migration if VHOST_F_LOG_ALL is
present in the virtio features and if VHOST_USER_PROTOCOL_F_LOG_SHMFD is
present in the vhost-user protocol features.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
backend like SPDK required to know how many virt queues to be handled
before gets VHOST_USER_SET_INFLIGHT_FD message.
fix dpdk core dump while processing vhost_user_set_inflight_fd:
#0 0x00007fffef47c347 in vhost_user_set_inflight_fd (pdev=0x7fffe2895998, msg=0x7fffe28956f0, main_fd=545) at ../lib/librte_vhost/vhost_user.c:1570
#1 0x00007fffef47e7b9 in vhost_user_msg_handler (vid=0, fd=545) at ../lib/librte_vhost/vhost_user.c:2735
#2 0x00007fffef46bac0 in vhost_user_read_cb (connfd=545, dat=0x7fffdc0008c0, remove=0x7fffe2895a64) at ../lib/librte_vhost/socket.c:309
#3 0x00007fffef45b3f6 in fdset_event_dispatch (arg=0x7fffef6dc2e0 <vhost_user+8192>) at ../lib/librte_vhost/fd_man.c:286
#4 0x00007ffff09926f3 in rte_thread_init (arg=0x15ee180) at ../lib/librte_eal/common/eal_common_thread.c:175
Signed-off-by: Arafatms <arafatms@outlook.com>
This patch adds all the seccomp rules missing for MSHV.
With this patch MSFT internal CI runs with seccomp enabled.
Signed-off-by: Muminul Islam <muislam@microsoft.com>
The vhost-user-net backend needs to prepare all queues before enabling vring.
For example, DPVGW will report 'the RX queue can't find' error, if we enable
vring immediately after kicking it out.
Signed-off-by: Arafatms <arafatms@outlook.com>
Adding the support for snapshot/restore feature for all supported
vhost-user devices.
The complexity of vhost-user-fs device makes it only partially
compatible with the feature. When using the DAX feature, there's no way
to store and remap what was previously mapped in the DAX region. And
when not using the cache region, if the filesystem is mounted, it fails
to be properly restored as this would require a special command to let
the backend know that it must remount what was already mounted before.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch moves all vhost-user common functions behind a new structure
VhostUserHandle. There is no functional changes intended, the only goal
being to prepare for storing information through this new structure,
limiting the amount of parameters that are needed for each function.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
With the new beta version, clippy complains about redundant allocation
when using Arc<Box<dyn T>>, and suggests replacing it simply with
Arc<dyn T>.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The cloud hypervisor tells the VM and the backend to support the PACKED_RING feature,
but it actually processes various variables according to the split ring logic, such
as last_avail_index. Eventually it will cause the following error (SPDK as an example):
vhost.c: 516:vhost_vq_packed_ring_enqueue: *ERROR*: descriptor has been used before
vhost_blk.c: 596:process_blk_task: *ERROR*: ====== Task 0x200113784640 req_idx 0 failed ======
vhost.c: 629:vhost_vring_desc_payload_to_iov: *ERROR*: gpa_to_vva((nil)) == NULL
Signed-off-by: Arafatms <arafatms@outlook.com>
If writing to the TAP returns EAGAIN then listen for the TAP to be
writable. When the TAP becomes writable attempt to process the TX queue
again.
Fixes: #2807
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This dependency bump needed some manual handling since the API changed
quite a lot regarding some RawFd being changed into either File or
AsRawFd traits.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Issue from beta verion of clippy:
Error: --> vm-virtio/src/queue.rs:700:59
|
700 | if let Some(used_event) = self.get_used_event(&mem) {
| ^^^^ help: change this to: `mem`
|
= note: `-D clippy::needless-borrow` implied by `-D warnings`
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_borrow
Signed-off-by: Bo Chen <chen.bo@intel.com>
Now that vhost crate allows the caller to set the header flags, we can
set NEED_REPLY whenever the REPLY_ACK protocol feature is supported from
both ends.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
For vhost-user devices, memory should be shared between CLH and
vhost-user backend. However, madvise DONTNEED doesn't working in
this case. So, let's use fallocate PUNCH_HOLE to discard those
memory regions instead.
Signed-off-by: Li Hangjing <lihangjing@bytedance.com>
Sometimes we need balloon deflate automatically to give memory
back to guest, especially for some low priority guest processes
under memory pressure. Enable deflate_on_oom to support this.
Usage: --balloon "size=0,deflate_on_oom=on" \
Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
Since using the VIRTIO configuration to expose the virtual IOMMU
topology has been deprecated, the virtio-iommu implementation must be
updated.
In order to follow the latest patchset that is about to be merged in the
upstream Linux kernel, it must rely on ACPI, and in particular the newly
introduced VIOT table to expose the information about the list of PCI
devices attached to the virtual IOMMU.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since the reconnection thread took on the responsibility to handle
backend initiated requests as well, the variable naming should reflect
this by avoiding the "reconnect" prefix.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Vhost user INFLIGHT_SHMFD protocol feature supports inflight I/O
tracking, this commit implement the vhost-user device (master) support
of the feature. Till this commit, specific vhost-user devices (blk, fs,
or net) have not enable this feature.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Add the support for reconnecting the backend request handler after a
disconnection/crash happened.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since the slave request handler is common to all vhost-user devices, the
same way the reconnection is, it makes sense to handle the requests from
the backend through the same thread.
The reconnection thread now handles both a reconnection as well as any
request coming from the backend.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit enables socket reconnection for vhost-user-fs backends. Note
that, till this commit:
- The re-establish of the slave communication channel is no supported. So
the socket reconnection does not support virtiofsd with DAX enabled.
- Inflight I/O tracking and restoring is not supported. Therefore, only
virtio-fs daemons that are not processing inflight requests can work
normally after reconnection.
- To make the restarted virtiofsd work normally after reconnection, the
internal status of virtiofsd should also be recovered. This is not the
work of cloud-hypervisor. If the virtio-fs daemon does not support
saving or restoring its internal status, then a re-mount in guest after
socket reconnection should be performed.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
This commit enables socket reconnection for vhost-user-blk backends.
Note that, till this commit, inflight I/O trakcing and restoring is not
supported. Therefore, only vhost-user-blk backend that are not processing
inflight requests can work normally after reconnection.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
We should try to read the last avail index from the vring memory aera. This
is necessary when handling vhost-user socket reconnection.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Function "GuestMemory::with_regions(_mut)" were mainly temporary methods
to access the regions in `GuestMemory` as the lack of iterator-based
access, and hence they are deprecated in the upstream vm-memory crate [1].
[1] https://github.com/rust-vmm/vm-memory/issues/133
Signed-off-by: Bo Chen <chen.bo@intel.com>
As the first step to complete live-migration with tracking dirty-pages
written by the VMM, this commit patches the dependent vm-memory crate to
the upstream version with the dirty-page-tracking capability. Most
changes are due to the updated `GuestMemoryMmap`, `GuestRegionMmap`, and
`MmapRegion` structs which are taking an additional generic type
parameter to specify what 'bitmap backend' is used.
The above changes should be transparent to the rest of the code base,
e.g. all unit/integration tests should pass without additional changes.
Signed-off-by: Bo Chen <chen.bo@intel.com>
Add a helper to VirtioCommon which returns duplicates of the EventFds
for kill and pause event.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The control queue was missing rt_sigprocmask syscall, which was causing
a crash when the VM was shutdown.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
When in client mode, the VMM will retry connecting the backend for a
minute before giving up.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The reconnection code is moved to the vhost-user module as it is a
common place to be shared across all vhost-user devices.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit implements the reconnection feature for vhost-user-net in
case the connection with the backend is shutdown.
The guest has no knowledge about what happens when a disconnection
occurs, which simplifies the code to handle the reconnection. The
reconnection happens with the backend, and the VMM goes through the
vhost-user setup once again.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We thought we could move the control queue to the backend as it was
making some good sense. Unfortunately, doing so was a wrong design
decision as it broke the compatibility with OVS-DPDK backend.
This is why this commit moves the control queue back to the VMM side,
meaning an additional thread is being run for handling the communication
with the guest.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
A lot of the VIRTIO reserved features should be supported or not by the
vhost-user backend. That means on the VMM side, these features should be
available, so that they don't get lost during the negotiation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The VIRTIO features should not be set before they are acked from the
guest. This code was only present to overcome a vhost crate limitation
that was expecting the VIRTIO features to be set before we could fetch
and set the protocol features.
The vhost crate has been recently fixed by removing the limitation,
therefore there's no need for this workaround in the Cloud Hypervisor
codebase anymore.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Everything that was shared in the net_util.rs file has been now moved to
the net_util crate. The only remaining bit was only used by the
virtio-net implementation, that is why this commit moves this code to
virtio-net, and since there's nothing left in net_util.rs, it can be
removed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Since the net_util crate contains the common code needed for processing
the control queue, let's use it and remove the duplicate from inside the
virtio-devices crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Moving helpers to the net_util crate since we don't want virtio-net
common code to be split between two places. The net_util crate should be
the only place to host virtio-net common code.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Factorize the virtio features and vhost-user protocol features
negotiation through a common function that blk, fs and net
implementations can directly rely on.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Make sure the virtio features are set upon device activation. At the
time the device is activated, we know the guest acknowledged the
features, which mean it's safe to set them back to the backend.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtio features are negotiated and set at the time the device is
created, hence there's no need to set the features again while going
through the vhost-user setup that is performed upon queue activation.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that the control queue is correctly handled by the backend, there's
no need to handle it as well from the VMM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Some refactoring is performed in order to always expect the irqfd to be
provided by VirtioInterrupt trait. In case no irqfd is available, we
simply fail initializing the vhost-user device. This allows for further
simplification since we can assume the interrupt will always be
triggered directly by the vhost-user backend without proxying through
the VMM. This allows for complete removal of the dedicated thread for
both block and net.
vhost-user-fs is a bit more complex as it requires the slave request
protocol feature in order to support DAX. That's why we still need the
VMM to interfere and therefore run a dedicated thread for it.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now all crates use edition = "2018" then the majority of the "extern
crate" statements can be removed. Only those for importing macros need
to remain.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: all if blocks contain the same code at the start
--> virtio-devices/src/mem.rs:508:9
|
508 | / if plug {
509 | | let handlers =
self.dma_mapping_handlers.lock().unwrap();
|
|_____________________________________________________________________^
|
= note: `-D clippy::branches-sharing-code` implied by `-D warnings`
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#branches_sharing_code
help: consider moving the start statements out like this
|
508 | let handlers = self.dma_mapping_handlers.lock().unwrap();
509 | if plug {
|
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: usage of `contains_key` followed by `insert` on a `BTreeMap`
--> virtio-devices/src/iommu.rs:439:17
|
439 | / if !mappings.contains_key(&domain) {
440 | | mappings.insert(domain, BTreeMap::new());
441 | | }
| |_________________^ help: try this:
`mappings.entry(domain).or_insert_with(|| BTreeMap::new());`
|
= note: `-D clippy::map-entry` implied by `-D warnings`
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#map_entry
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Issue from beta version of clippy:
--> virtio-devices/src/vsock/csm/txbuf.rs:69:34
|
69 | Box::new(unsafe {mem::MaybeUninit::<[u8;
Self::SIZE]>::uninit().assume_init()}));
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[deny(clippy::uninit_assumed_init)]` on by default
= help: for further information visit
https://rust-lang.github.io/rust-clippy/master/index.html#uninit_assumed_init
Fix backported Firecracker a8c9dffad439557081f3435a7819bf89b87870e7.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: reference to packed field is unaligned
--> virtio-devices/src/vhost_user/fs.rs:85:21
|
85 | fs.flags[i].bits() as i32,
| ^^^^^^^^^^^
|
= note: `-D unaligned-references` implied by `-D warnings`
= warning: this was previously accepted by the compiler but is being
phased out; it will become a hard error in a future release!
= note: for more information, see issue #82523
<https://github.com/rust-lang/rust/issues/82523>
= note: fields of packed structs are not properly aligned, and
creating a misaligned reference is undefined behavior (even if that
reference is never dereferenced)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Adding the support for an OVS vhost-user backend to connect as the
vhost-user client. This means we introduce with this patch a new
option to our `--net` parameter. This option is called 'server' in order
to ask the VMM to run as the server for the vhost-user socket.
Fixes#1745
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In some situations (booting with OVMF) fewer queues will be enabled
therefore we should iterate over the number of enabled queues (as passed
into VirtioDevice::activate()) rather than the number of create tap
devices.
Fixes: #2578
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This allows the guest to reprogram the offload settings and mitigates
issues where the Linux kernel tries to reprogram the queues even when
the feature is not advertised.
Fixes: #2528
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than erroring out and stalling the queue instead report an error
message if the command is invalid and return an error to the guest via
the status field.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Cleanup the control queue handling in preparation for supporting
alternative commands.
Note that this change does not make the MQ handling spec compliant.
According to the specification MQ should only be enabled once the number
of queue pairs the guest would like to use has been specified. The only
improvement towards the specication in this change is correct error
handling if the guest specifies an inappropriate number of queues (out
of range.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Configure the tap offload features to match those that the guest has
acknowledged. The function for converting virtio to tap features came
from crosvm:
4786cee521/devices/src/virtio/net.rs (115)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to support using Versionize for state structures it is necessary
to use simpler, primitive, data types in the state definitions used for
snapshot restore.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Duplicate the fd that is specified in the config so that be used again
after a reboot. When rebooting we destroy all VM state and restore from
the config.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
With current serde_derive it is possible to #[derive(Serialize)] on
packed structures if they implement Copy. This allows the removal of the
manual equivalent code.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Simplify snapshot & restore code by using generics to specify helper
functions that take / make a Serialize / Deserialize struct
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This patch moves out the actual processing on the TX queue from the
`handle_tx_event()` function into a separate function,
e.g. `process_tx()`. This allows us to resume the TX queue processing
without reading from the TX queue EventFd, which is needed for rate
limiting support.
No functional change.
Signed-off-by: Bo Chen <chen.bo@intel.com>
To support I/O throttling on virt-net devices, we need to use the
'rate_limiter' module from the 'net_utils' crate. Given the
'virtio-devices' crate has dependency on the 'net_utils', we will need
to move the 'rate_limiter' module out of the 'virtio-devices' crate to
avoid circular dependency issue. Considering the 'rate_limiter' is not
virtio specific and could be reused for non virtio devices, we move it
to its own crate.
Signed-off-by: Bo Chen <chen.bo@intel.com>
warning: name `IORegion` contains a capitalized acronym
--> pci/src/configuration.rs:320:5
|
320 | IORegion = 0x01,
| ^^^^^^^^ help: consider making the acronym lowercase, except the initial letter (notice the capitalization): `IoRegion`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
warning: name `ConvertFromUTF8` contains a capitalized acronym
--> virtio-devices/src/vsock/unix/mod.rs:32:5
|
32 | ConvertFromUTF8(std::str::Utf8Error),
| ^^^^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `ConvertFromUtf8`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
error: name `TYPE_UNKNOWN` contains a capitalized acronym
--> vm-virtio/src/lib.rs:48:5
|
48 | TYPE_UNKNOWN = 0xFF,
| ^^^^^^^^^^^^ help: consider making the acronym lowercase, except the initial letter: `Type_Unknown`
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#upper_case_acronyms
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In case of the virtio frontend driver doesn't need interupts for
certain queue event, it may explicitly write VIRTIO_MSI_NO_VECTOR
to the virtio common configuration, or it may doesn't configure
the event type vector at all.
This patch initializes both MSI-X configuration vector and queue vector
with VIRTIO_MSI_NO_VECTOR, so that the backend drivers won't trigger
unexpected interrupts to the guest.
Signed-off-by: Zide Chen <zide.chen@intel.com>
Now that virtio devices can be updated with add_memory_region(), there's
no need to keep update_memory() around.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Assuming vhost-user devices support CONFIGURE_MEM_SLOTS protocol
feature, we introduce a new method to the VirtioDevice trait in order to
update one single memory at a time.
In case CONFIGURE_MEM_SLOTS is not supported by the backend (feature not
acked), we fallback onto the current way of updating the memory
mappings, that is with SET_MEM_TABLE.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
On x86_64 architecture, multiple syscalls were missing when shutting
down the vhost-user-net device along with the VM. This was causing the
usual crash related to seccomp filters.
This commit adds these missing syscalls to fix the issue.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
There is no need to get the vring base when resetting the vhost-user
device. This was mostly ignored, but in some cases, it was causing some
actual errors.
A reset must simply be a combination of disabling the vrings along with
the reset of the owner.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Originally, VhostUserSetup is only used by vhost-user-fs. While
vhost-user-blk and vhost-user-net have their own error messages,
we rename VhostUserSetup to VhostUserFsSetup.
Signed-off-by: Jiachen Zhang <zhangjiachen.jaycee@bytedance.com>
Create two functions for registering/unregistering DMA mapping handlers,
each handler being associated with a VFIO device.
Whenever the plugged_size is modified (which means triggered by the
virtio-mem driver in the guest), the virtio-mem backend is responsible
for updating the DMA mappings related to every VFIO device through the
handler previously provided.
It's important to update the map when the handler is either registered
or unregistered as well, as we don't want to miss some plugged memory
that would have been added before the VFIO device is added to the VM.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Moving to the latest version of the rust-vmm/vhost crate, before it gets
published on crates.io.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
In particular update for the vmm-sys-util upgrade and all the other
dependent packages. This requires an updated forked version of
kvm-bindings (due to updated vfio-ioctls) but allowed the removal of our
forked version of kvm-ioctls.
The changes to the API from kvm-ioctls and vmm-sys-util required some
other minor changes to the code.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The vhost crate from rust-vmm is ready, which is why we do the switch
from the Cloud Hypervisor fork to the upstream crate.
At the same time, we rename the crate from vhost_rs to vhost.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that ExternalDmaMapping is defined in vm-device, let's use it from
there.
This commit also defines the function get_host_address_range() to move
away from the vfio-ioctls dependency.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The main idea behind this commit is to remove all the complexity
associated with TX/RX handling for virtio-net. By using writev() and
readv() syscalls, we could get rid of intermediate buffers for both
queues.
The complexity regarding the TAP registration has been simplified as
well. The RX queue is only processed when some data are ready to be
read from TAP. The event related to the RX queue getting more
descriptors only serves the purpose to register the TAP file if it's not
already.
With all these simplifications, the code is more readable but more
performant as well. We can see an improvement of 10% for a single
queue device.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If the function can never return an error this is now a clippy failure:
error: this function's return value is unnecessarily wrapped by `Result`
--> virtio-devices/src/watchdog.rs:215:5
|
215 | / fn set_state(&mut self, state: &WatchdogState) -> io::Result<()> {
216 | | self.common.avail_features = state.avail_features;
217 | | self.common.acked_features = state.acked_features;
218 | | // When restoring enable the watchdog if it was previously enabled. We reset the timer
... |
223 | | Ok(())
224 | | }
| |_____^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unnecessary_wraps
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In anticipation for supporting the notifier function for the legacy
interrupt source group, we need this function to return an EventFd
instead of a reference to this same EventFd.
The reason is we can't return a reference when there's an Arc<Mutex<>>
involved in the call chain.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This thread is virtio-net specific, so it is not handled in the common
virtio device code.
The non-vhost implementation resumes the thread itself. Do the same
thing for vhost-user-net.
Signed-off-by: Wei Liu <liuwe@microsoft.com>
This commit introduces a new information to the VirtioMemZone structure
in order to know if the memory zone is backed by hugepages.
Based on this new information, the virtio-mem device is now able to
determine if madvise(MADV_DONTNEED) should be performed or not. The
madvise documentation specifies that MADV_DONTNEED advice will fail if
the memory range has been allocated with some hugepages.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Hui Zhu <teawater@antfin.com>
This commit performs some refactoring to make all functions a method
from a specific object, and in particular methods for MemEpollHandler.
The point is to simplify the code to make it more readable.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Adjust the code to comply better with the virtio-mem specification by
adding some validation for the virtio-mem configuration, but also by
updating the virtio-mem configuration itself.
Nowhere in the virtio-mem specification is stated the usable region size
must be adjusted everytime the plugged size changes. For simplification
reason, and without going against the specification, the usable region
size is now kept static, setting its value to the size of the whole
region.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By introducing a ResizeSender object, we avoid having a Resize clone
with a different content than the original Resize object.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Using directly preadv and pwritev, we can simply use a RawFd instead of
a file, and we don't need to use the more complex implementation from
the qcow crate.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit adds the asynchronous support for fixed VHD disk files.
It introduces FixedVhd as a new ImageType, moving the image type
detection to the block_util crate (instead of qcow crate).
It creates a new vhd module in the block_util crate in order to handle
VHD footer, following the VHD specification.
It creates a new fixed_vhd_async module in the block_util crate to
implement the asynchronous version of fixed VHD disk file. It relies on
io_uring.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This patch enables multi-queue support for creating virtio-net devices by
accepting multiple TAP fds, e.g. '--net fds=3:7'.
Fixes: #2164
Signed-off-by: Bo Chen <chen.bo@intel.com>
This helper can open a TAP device and configure the interface on it. If
the device needs to be opened multiple times for MQ then it also handles
that correctly.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that BlockIoUring is the only implementation of virtio-block,
handling both synchronous and asynchronous backends based on the
AsyncIo trait, we can rename it to Block.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that both synchronous and asynchronous backends rely on the
asynchronous version of virtio-block (namely BlockIoUring), we can
get rid of the synchronous version (namely Block).
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the synchronous QCOW file implementation present in the qcow
crate, we created a new qcow_sync module in block_util that ports this
synchronous implementation to the AsyncIo trait.
The point is to reuse virtio-blk asynchronous implementation for both
synchronous and asynchronous backends.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the synchronous RAW file implementation present in the qcow
crate, we created a new raw_sync module in block_util that ports this
synchronous implementation to the AsyncIo trait.
The point is to reuse virtio-blk asynchronous implementation for both
synchronous and asynchronous backends.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Based on the new DiskFile and AsyncIo traits, the implementation of
asynchronous block support does not have to be tied to io_uring anymore.
Instead, the only thing the virtio-blk implementation knows is that it
is using an asynchronous implementation of the underlying disk file.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Even though the driver can provide fewer queues than those advertised
for some device types their is a minimum number that is required for
operation.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It is permissable for the driver to program fewer queues than offered by
the device. Filter the queues so that only the ready ones are included
and check that they have valid addresses configured.
Fixes: #2136
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Don't assume that the number of queues provided match the number of
queues offered. The virtio spec allows the driver to program fewer
queues.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rather than having to give and return the ioeventfd used for a device
clone them each time. This will make it simpler when we start handling
the driver enabling fewer queues than advertised by the device.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We have killed the device thread by writing to the exit EventFd but we
should wait for them to quit to ensure consistency.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to make the thread naming more useful derive their name from
the device id (which can be supplied by the user) and a device specific
suffix that has details of the individual queue (or queue pair.)
e.g.
rob@artemis:~$ pstree -p -c -l -t `pidof cloud-hypervisor`
cloud-hyperviso(27501)─┬─{_console}(27525)
├─{_disk0_q0}(27529)
├─{_disk0_q1}(27532)
├─{_net1_ctrl}(27533)
├─{_net1_qp0}(27534)
├─{_net1_qp1}(27535)
├─{_rng}(27526)
├─{http-server}(27504)
├─{seccomp_signal_}(27502)
├─{signal_handler}(27523)
├─{vcpu0}(27520)
├─{vcpu1}(27522)
└─{vmm}(27503)
Fixes: #2077
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Sometimes when running under the CI tests fail due to a barrier not
being released and the guest blocks on an MMIO write. Add further
debugging to try and identify the issue.
See: #2118
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
On aarch64, the openat() syscall was missing from the seccomp filters
list, preventing the test_watchdog from running properly.
Fixes#2103
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
error: field assignment outside of initializer for an instance created with Default::default()
--> virtio-devices/src/mem.rs:496:9
|
496 | resp.resp_type = resp_type;
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
note: consider initializing the variable with `mem::VirtioMemResp { resp_type: resp_type, ..Default::default() }` and removing relevant reassignments
--> virtio-devices/src/mem.rs:495:9
|
495 | let mut resp = VirtioMemResp::default();
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#field_reassign_with_default
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
On the CI we are seeing issues with the activation barriers not being released:
cloud-hypervisor: 12.452434193s: INFO:vmm/src/vm.rs:413 -- Waiting for barrier
cloud-hypervisor: 12.452499794s: INFO:virtio-devices/src/block.rs:382 -- Changing cache mode to writeback
cloud-hypervisor: 12.452605195s: INFO:vmm/src/vm.rs:413 -- Waiting for barrier
cloud-hypervisor: 12.452684596s: INFO:virtio-devices/src/transport/pci_device.rs:671 -- Waiting for barrier
cloud-hypervisor: 12.452708196s: INFO:virtio-devices/src/transport/pci_device.rs:673 -- Barrier released
cloud-hypervisor: 12.452717596s: INFO:vmm/src/vm.rs:415 -- Barrier released
Add some debugging to try and identify the vause of this issue.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Add support for creating virtio-net device from existing TAP fd.
Currently only a single fd and thus no-more than 2 queues (one pair) is
suppored.
Fixes: #2052
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When a device is ready to be activated signal to the VMM thread via an
EventFd that there is a device to be activated. When the VMM receives a
notification on the EventFd that there is a device to be activated
notify the device manager to attempt to activate any devices that have
not been activated.
As a side effect the VMM thread will create the virtio device threads.
Fixes: #1863
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
We need to be able to return the barrier from the code that prepares to
activate the virtio device. This triggered by a write to the
configuration fields stored in the PCI BAR. Since bars can be accessed
by both memory mapping and through PCI config I/O several prototypes
must be changed.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This can be uses to indicate to the caller that it should wait on the
barrier before returning as there is some asynchronous activity
triggered by the write which requires the KVM exit to block until it's
completed.
This is useful for having vCPU thread wait for the VMM thread to proceed
to activate the virtio devices.
See #1863
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
When a total ordering between multiple atomic variables is not required
then use Ordering::Acquire with atomic loads and Ordering::Release with
atomic stores.
This will improve performance as this does not require a memory fence
on x86_64 which Ordering::SeqCst will use.
Add a comment to the code in the vCPU handling code where it operates on
multiple atomics to explain why Ordering::SeqCst is required.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
This device operates a single virtq. When the driver offers a descriptor
to the device it is interpreted as a "ping" to indicate that the guest
is alive. A periodic timer fires and if when the timer is fired there
has not been a "ping" from the guest then the device will reset the VM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
While using the virtio-iommu device involving L2 scenario, and tearing
things down all the way from L2 back to L0 exposed some bad syscalls
that were not part of the authorized list.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The goal here is to replace anywhere possible a virtio structure
with a "C, packed" representation by a "C" representation. Some
virtio structures are not expected to be packed, therefore there's
no reason for using the more restrictive "C, packed" representation.
This is important since "packed" representation can still cause
undefined behaviors with Rust 2018.
By removing the need for "packed" representation, we can simplify a
bit of code by deriving the Serialize trait without writing the
implementation ourselves.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Small patch creating a dedicated `block_io_uring_is_supported()`
function for the non-io_uring case, so that we can simplify the
code in the DeviceManager.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtio-balloon change the memory size is asynchronous.
VirtioBalloonConfig.actual of balloon device show current balloon size.
This commit add memory_actual_size to vm.info to show memory actual size.
Signed-off-by: Hui Zhu <teawater@antfin.com>
Misspellings were identified by https://github.com/marketplace/actions/check-spelling
* Initial corrections suggested by Google Sheets
* Additional corrections by Google Chrome auto-suggest
* Some manual corrections
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
The missing syscall rt_sigprocmask(2) was triggered for the musl build
upon rebooting the VM, and was causing the VM to be killed.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit gives the possibility to create a virtio-mem device with
some memory already plugged into it. This is preliminary work to be
able to reboot a VM with the virtio-mem region being already resized.
Signed-off-by: Hui Zhu <teawater@antfin.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The virtio-mem driver is generating some warnings regarding both size
and alignment of the virtio-mem region if not based on 128MiB:
The alignment of the physical start address can make some memory
unusable.
The alignment of the physical end address can make some memory
unusable.
For these reasons, the current patch enforces virtio-mem regions to be
128MiB aligned and checks the size provided by the user is a multiple of
128MiB.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Implement support for associating a virtio-mem device with a specific
guest NUMA node, based on the ACPI proximity domain identifier.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
By testing manually the memory resizing through virtio-mem, several
missing syscalls have been identified.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The Windows virtio block driver puts multiple data descriptors between
the header and the status footer. To handle this when parsing iterate
over the descriptor chain until the end is reached accumulating the
address and length pairs in a vector. For execution iterate over the
vector and make sequential reads from the disk for each data descriptor.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We observed CI instability for the past couple of days. This
instability is confirmed to be a result of incomplete seccomp
filters. Given the filter on 'virtio_vsock' is recently added and
is missing 'brk', it is likely to be the root cause of the
instability.
Signed-off-by: Bo Chen <chen.bo@intel.com>
This removes the dependency of the pci crate on the devices crate which
now only contains the device implementations themselves.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Split the block device implementation into code that be used in common
between multiple different virtio device implementations.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
In order to simplify the transition to VirtioCommon and to avoid needing
to set empty fields derive Default for VirtioCommon.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Rearrange the code to match other devices which makes it easier to prep
for sharing this between other devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Move the if-let for the taps later which makes the earlier activation
code identical to other devices.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>