We want to prevent from losing interrupts while they are masked. The
way they can be lost is due to the internals of how they are connected
through KVM. An eventfd is registered to a specific GSI, and then a
route is associated with this same GSI.
The current code adds/removes a route whenever a mask/unmask action
happens. Problem with this approach, KVM will consume the eventfd but
it won't be able to find an associated route and eventually it won't
be able to deliver the interrupt.
That's why this patch introduces a different way of masking/unmasking
the interrupts, simply by registering/unregistering the eventfd with the
GSI. This way, when the vector is masked, the eventfd is going to be
written but nothing will happen because KVM won't consume the event.
Whenever the unmask happens, the eventfd will be registered with a
specific GSI, and if there's some pending events, KVM will trigger them,
based on the route associated with the GSI.
Suggested-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We should not assume the offset produced by ECAM is identical to the
CONFIG_ADDRESS register of legacy PCI port io enumeration.
Signed-off-by: Qiu Wenbo <qiuwenbo@phytium.com.cn>
This option improves the security of the guest by randomising the start
address of the kernel in physical memory. We should turn this on so as
to ensure all our functionality such as memory hotplug and kernel
loading works as this is an option used widely in production.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Recently, vhost_user_block gained the ability of actively polling the
queue, a feature that can be disabled with the poll_queue property.
This change adds this property to DiskConfig, so it can be used
through the "disk" argument.
For the moment, it can only be used when vhost_user=true, but this
will change once virtio-block gets the poll_queue feature too.
Fixes: #787
Signed-off-by: Sergio Lopez <slp@redhat.com>
Fix "readonly" and "wce" defaults in cloud-hypervisor.yaml to match
their respective defaults in config.rs:DiskConfig.
Signed-off-by: Sergio Lopez <slp@redhat.com>
This is a perfectly acceptable situation as it causes the backend to
exit because the VMM has closed the connection. This addresses the
rather ugly reporting of errors from the backend that appears
interleaved with the output from the VMM.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Return an error wen recvmsg() returns without a message using the
libc::ECONNRESET error so that the upper levels will correctly
interpret this as the connection being broken.
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
It's missing a few knobs (readonly, vhost, wce) that should be exposed
through the rest API.
Fixes: #790
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The kernel does not adhere to the ACPI specification (probably to work
around broken hardware) and rather than busy looping after requesting an
ACPI reset it will attempt to reset by other mechanisms (such as i8042
reset.)
In order to trigger a reset the devices write to an EventFd (called
reset_evt.) This is used by the VMM to identify if a reset is requested
and make the VM reboot. As the reset_evt is part of the VMM and reused
for both the old and new VM it is possible for the newly booted VM to
immediately get reset as there is an old event sitting in the EventFd.
The simplest solution is to "drain" the reset_evt EventFd on reboot to
make sure that there is no spurious events in the EventFd.
Fixes: #783
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
Now that vhost_user_backend and vm-virtio do support EVENT_IDX, use it
in vhost_user_block to reduce the number of notifications sent between
the driver and the device.
This is specially useful when using active polling on the virtqueue,
as it'll be implemented by a future patch.
This is a snapshot of kvm_stat while generating ~60K IOPS with fio on
the guest without EVENT_IDX:
Event Total %Total CurAvg/s
kvm_entry 393454 20.3 62494
kvm_exit 393446 20.3 62494
kvm_apic_accept_irq 378146 19.5 60268
kvm_msi_set_irq 369720 19.0 58881
kvm_fast_mmio 370497 19.1 58817
kvm_hv_timer_state 10197 0.5 1715
kvm_msr 8770 0.5 1443
kvm_wait_lapic_expire 7018 0.4 1118
kvm_apic 2768 0.1 538
kvm_pv_tlb_flush 2028 0.1 360
kvm_vcpu_wakeup 1453 0.1 278
kvm_apic_ipi 1384 0.1 269
kvm_fpu 1148 0.1 164
kvm_pio 574 0.0 82
kvm_userspace_exit 574 0.0 82
kvm_halt_poll_ns 24 0.0 3
And this is the snapshot while doing the same thing with EVENT_IDX:
Event Total %Total CurAvg/s
kvm_entry 35506 26.0 3873
kvm_exit 35499 26.0 3873
kvm_hv_timer_state 14740 10.8 1672
kvm_apic_accept_irq 13017 9.5 1438
kvm_msr 12845 9.4 1421
kvm_wait_lapic_expire 10422 7.6 1118
kvm_apic 3788 2.8 502
kvm_pv_tlb_flush 2708 2.0 340
kvm_vcpu_wakeup 1992 1.5 258
kvm_apic_ipi 1894 1.4 251
kvm_fpu 1476 1.1 164
kvm_pio 738 0.5 82
kvm_userspace_exit 738 0.5 82
kvm_msi_set_irq 701 0.5 69
kvm_fast_mmio 238 0.2 4
kvm_halt_poll_ns 50 0.0 1
kvm_ple_window_update 28 0.0 0
kvm_page_fault 4 0.0 0
It can be clearly appreciated how the number of vm exits per second,
specially the ones related to notifications (kvm_fast_mmio and
kvm_msi_set_irq) is drastically lower.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Add helpers to Vring and VhostUserSlaveReqHandler for EVENT_IDX, so
consumers of this crate can make use of this feature.
Signed-off-by: Sergio Lopez <slp@redhat.com>
VIRTIO_RING_F_EVENT_IDX is a virtio feature that allows to avoid
device <-> driver notifications under some circunstances, most
notably when actively polling the queue.
This commit implements support for in in the vm-virtio
crate. Consumers of this crate will also need to add support for it by
exposing the feature and calling using update_avail_event() and
get_used_event() accordingly.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Relying on the latest vm-memory version, including the freshly
introduced structure GuestMemoryAtomic, this patch replaces every
occurrence of Arc<ArcSwap<GuestMemoryMmap> with
GuestMemoryAtomic<GuestMemoryMmap>.
The point is to rely on the common RCU-like implementation from
vm-memory so that we don't have to do it from Cloud-Hypervisor.
Fixes#735
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The integration test test_memory_mergeable_on has been fairly unstable
for quite some time now. Because it can take some time for the VM to be
spawned and to be able to perform a correct measure of the PSS, this
commit simply increases the time before such measure is done.
This should return more accurate PSS results, which should help
stabilize the test.
Fixes#781
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Now that vhost_user_fs rust daemon supports virtiofs's dax mode, this adds
the two dax tests accordingly.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
This adds the missing part of supporting virtiofs dax on the slave end,
that is, receiving a socket pair fd from the master end to set up a
communication channel for sending setupmapping & removemapping messages.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
This introduces setupmapping and removemapping methods to server.rs,
passthrough.rs and filesystem.rs in order to support virtiofs dax mode
inside guest.
Since we don't really want the server.rs to know that it is dealing with
vhost-user specifically, this is making it more generic by adding a new
trait which has three functions map()/unmap()/sync() corresponding to
fs_slave_{map, unmap, sync}, server.rs will take anything that implements
the trait.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
This introduces SlaveFsCacheReq which implements
VhostUserMasterReqHandler to handle map/unmap requests.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
The unit tests are run from cargo test through multiple threads of the
same process. For this reason, all these threads share their file
descriptors (because that's how this works on Linux), which means that
any of them can close a file descriptor opened from another thread.
In the context of create_listener() and accept_connection() tests, they
can run concurrently and this generates some failure when the file
descriptor create_listener() is binding to is being closed from the
accept_connection() test.
In order to avoid such race condition, this patch simply removes the
part of the unit test performing an explicit and unsafe file descriptor
closure.
Fixes#759
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
If no socket is supplied when enabling "vhost_user=true" on "--disk"
follow the "exe" path in the /proc entry for this process and launch the
network backend (via the vmm_path field.)
Signed-off-by: Rob Bradford <robert.bradford@intel.com>
The unit tests require some specific Linux capabilities and also to have
access to /dev/kvm device. This commit makes sure we enable only what's
necessary instead of blindly enable full priviliges with --privileged
option.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
We need the host IPC for sharing eventfds with KVM, and the host network
for VFIO.
We also enforce the no-seccomp setting on the container, to overcome any
potential filtering set by our container's Ubuntu base.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When a virtio-fs device is created with a dedicated shared region, by
default the region should be mapped as PROT_NONE so that no pages can be
faulted in.
It's only when the guest performs the mount of the virtiofs filesystem
that we can expect the VMM, on behalf of the backend, to perform some
new mappings in the reserved shared window, using PROT_READ and/or
PROT_WRITE.
Fixes#763
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>