After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.
The value returned indicates success or error;
0 is success, any non-zero value is an error.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.
This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Sets logging shared memory space.
When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
vu_dev will be needed to log page update.
Add the parameter to:
vring_used_write()
vu_queue_fill_by_index()
vu_queue_fill()
vring_used_idx_set()
vu_queue_flush()
The new parameter is unused for now.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
We don't modify the structure in some virtio functions.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.
In this commit, we add a command switch enabling such funtionality
for inbound local traffic.
For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We need to initialize vhost-user structures with --fd too.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Merge code from tap_backend_init(), tap_sock_tun_init() and
tap_listen_handler() to set epoll_ref entry and to add it
to epollfd.
No functionality change
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In udp_vu_sock_recv(), collect a segment with a size defined to
IP_MAX_MTU + ETH_HLEN + sizeof(struct virtio_net_hdr_mrg_rxbuf)
The original version double counted the IP header: IP_MAX_MTU includes
the IP header, and so did hdrlen.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In flow_sidx_hash() we verify that the flow we're hashing doesn't have an
unspecified endpoint address, or zero for either port. The hash table only
works if we're looking for exact matches of address and port, and this is
attempting to catch any cases where we might have left address or port
unpopulated or filled with a wildcard.
This doesn't really work though, because there are cases where unspecified
addresses or zero ports are correct:
* We already use unspecified addresses for our address in cases where we
don't know the specific local address for that side, and exclude the
obvious extra check on side->oaddr for that reason.
* Zero port numbers aren't strictly forbidden over the wire. We forbid
them for TCP & UDP because they can't safely be handled on the socket
side. However for ICMP a zero id, which goes in the port field is
valid.
* Possible future flow types (for example, for multicast protocols) might
legitimately have an unspecified address.
Although it makes them easier to miss, these sorts of sanity checks really
have to be done at the protocol / flow type layer, and we already do so.
Remove the checks in flow_sidx_hash() other than checking that the pif
is specified.
Reported-by: Stefan <steffhip@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=105
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In udp_flow_new() we reject a flow if the endpoint isn't unicast, or it has
a zero endpoint port. Those conditions aren't strictly illegal, but we
can't safely handle them at present:
* Multicast UDP endpoints are certainly possible, but our current flow
tracking only makes sense for simple unicast flows - we'll need
different handling if we want to handle multicast flows in future
* It's not entirely clear if port 0 is RFC-ishly correct, but for socket
interfaces port 0 sometimes has a special meaning such as "pick the port
for me, kernel". That makes flows on port 0 unsafe to forward in the
usual way.
For the same reason we also can't safely handle port 0 as our port. In
principle that's also true for our address, however in the case of flows
initiated from a socket, we may not know our address since the socket
could be bound to 0.0.0.0 or ::, so we can only verify that our address
is unicast for flows initiated from the tap side.
Refine the current check in udp_flow_new() to slightly more detailed checks
in udp_flow_from_sock() and udp_flow_from_tap() to make what is and isn't
handled clearer. This makes this checking more similar to what we do for
TCP connections.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_vu_data_from_sock() we compute IPv4 header checksum only
for the first and the last packets, and re-use the first packet checksum
for all the other packets as the content of the header doesn't change.
It's more accurate to check the dlen value to know if the checksum
should change as dlen is the only information that can change in the
loop.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TARGET_ARCH is computed from '$(CC) -dumpmachine' using external
bash commands like echo, cut, tr and sed. This can be done using
make internal string functions.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the vhost-user <-> virtio-net path ignores checksums, we usually
don't calculate them when sending packets to the guest. So, we always
pass no_tcp_csum=true to tcp_fill_headers(). We do want accurate
checksums when capturing packets though, so the captures don't show bogus
values.
Currently we handle this by updating the checksum field immediately before
writing the packet to the capture file, using tcp_vu_update_check(). This
is unnecessary, though: in each case tcp_fill_headers() is called not very
long before, so we can alter its no_tcp_csum parameter pased on whether
we're generating captures or not.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers. Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed. This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases. However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.
So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.
More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous. Furthermore, the
payload itself might not be contiguous.
So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it. In the
current caller we've already know where the TCP header is, so we might as
well just pass it in. This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the vhost-user code we have a number of places where we need to locate
a particular header within the guest-supplied IO vector. We need to work
out which buffer the header is in, and verify that it's contiguous and
aligned as we need. At the moment this is open-coded, but introduce a
helper to make this more straightforward.
We add a new datatype 'struct iov_tail' representing an IO vector from
which we've logically consumed some number of headers. The IOV_PULL_HEADER
macro consumes a new header from the vector, returning a pointer and
updating the iov_tail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...to quickly suppress a false positive from Coverity, which assumes
that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen
being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that
iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
If the connection to the vhost-user front end is closed during transfers
virtio rings are deconfigured and not available anymore, but we can
try to access them to process queued data. This can trigger a SIGSEG as
we try to access unavailable memory.
To fix that check vq->vring.avail is sane before accessing the vring
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function only has callers in tcp_buf.c. More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.
Therefore, move it to tcp_buf.c.
Slightly reformat the prototypes while we're at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.
For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.
For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.
We need a 64bit address, so replace struct desc by struct iovec
and update range checking.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It's widely considered a legacy option nowadays, and I've haven't seen
clients setting it since Windows 95, but it's convenient for a minimal
DHCP client not using raw IP sockets such as what I'm playing with for
muvm.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I'm trying to speed up and simplify IP address acquisition in muvm.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We want to add support for option 80 (Rapid Commit, RFC 4039), whose
length is 0.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.
To this end, if we don't find any usable host interface, instead of
exiting:
- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
as default gateway
- for IPv6, don't assign any address (forcibly disable DHCPv6), and
use the *first* link-local address we observe to represent the
guest/container. Advertise fe80::1 as default gateway
- use 'tap0' as default interface name for pasta
Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.
Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).
Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since 9a0e544f05bf the NDP tests attempt to explicitly wait for DAD to
complete, rather than just having a hard coded sleep. However, the
conditions we use are a bit sloppy and allow for a number of possible cases
where it might not work correctly. Stefano seems to be hitting one of
these (though I'm not sure which) with some later patches.
- We wait for *lack* of a tentative address, so if the first check occurs
before we have even a tentative address it will bypass the delay
- It's not entirely clear if the permanent address will always appear
as soon as the tentative address disappears
- We weren't filtering on interface
- We were doing the filtering with ip-address options rather than in jq.
However in at least in some circumstances this seems to result in an
empty .addr_info field, rather than omitting it entirely, which could
cause us to get the wrong result
So, instead, explicitly wait for the address we need to be present: an
RA provided address on the external interface. While we're here we remove
the requirement that it have global scope: the "kernel_ra" check is already
sufficient to make sure this address comes from an NDP RA, not something
else. If it's not the global scope address we expect, better to check it
and fail, rather than keep waiting.
Fixes: 9a0e544f05bf ("test: Improve test for NDP assigned prefix")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is very visible with muvm, but it also happens with QEMU: we're
sending the first unsolicited router advertisement milliseconds after
the guest connects.
That's usually pointless because, when the hypervisor connects, the
guest is typically not ready yet to process anything of that sort:
it's still booting. And if we happen to send it late enough (still
milliseconds), with muvm, while the message is discarded, it
sometimes (slightly) delays the response to the first solicited
router advertisement, which is the one we need to have coming fast.
Skip sending the unsolicited advertisement on the first timer run,
just calculate the next delay. Keep it simple by observing that we're
probably not trying to reach the 1970s with IPv6.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
By dropping the filter on prefix length, commit 910f4f910301
("test: Don't require 64-bit prefixes in perf tests") broke tests on
setups where two global unicast IPv6 addresses are available, which
is the typical case when the "host" is a VM running under passt with
addresses from SLAAC and DHCPv6, because two addresses will be
returned.
Pick the first one instead. We don't really care about the prefix
length, any of these addresses will work.
Fixes: 910f4f910301 ("test: Don't require 64-bit prefixes in perf tests")
Link: https://archives.passt.top/passt-dev/20241119214344.6b4a5b3a@elisabeth/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 90e83d50a9bd ("Don't take "our" MAC address from the host")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It's not true that there's no mapping by default: there's no mapping
in the --map-guest-addr sense, by default, but in that case
the default --map-host-loopback behaviour prevails.
While at it, fix a typo.
Fixes: 57b7bd2a48a1 ("fwd, conf: Allow NAT of the guest's assigned address")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
RFC 9293, 3.8.4 says:
Implementers MAY include "keep-alives" in their TCP implementations
(MAY-5), although this practice is not universally accepted. Some
TCP implementations, however, have included a keep-alive mechanism.
To confirm that an idle connection is still active, these
implementations send a probe segment designed to elicit a response
from the TCP peer. Such a segment generally contains SEG.SEQ =
SND.NXT-1 and may or may not contain one garbage octet of data. If
keep-alives are included, the application MUST be able to turn them
on or off for each TCP connection (MUST-24), and they MUST default to
off (MUST-25).
but currently, tcp_data_from_tap() is not aware of this and will
schedule a fast re-transmit on the second keep-alive (because it's
also a duplicate ACK), ignoring the fact that the sequence number was
rewinded to SND.NXT-1.
ACK these keep-alive segments, reset the activity timeout, and ignore
them for the rest.
At some point, we could think of implementing an approximation of
keep-alive segments on outbound sockets, for example by setting
TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single
keep-alive segment at approximately the same time, and never reset the
connection. That's beyond the scope of this fix, though.
Reported-by: Tim Besard <tim.besard@gmail.com>
Link: https://github.com/containers/podman/discussions/24572
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We enter the timer handler with the ACK_TO_TAP_DUE flag, call
tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we
acknowledged everything meanwhile, so we return early, but we also
need to reset that flag to avoid unnecessarily scheduling the timer
over and over again until more pending data appears.
I'm not sure if this fixes any real issue, but I've spotted this
in several logs reported by users, including one where we have some
unexpected bursts of high CPU load during TCP transfers at low rates,
from https://github.com/containers/podman/issues/23686.
Link: https://github.com/containers/podman/discussions/24572
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We recently added support for sending unsolicited NDP Router Advertisement
packets. While we (correctly) disable this if the --no-ra option is given
we incorrectly still send them if --no-ndp is set. Fix the oversight.
Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
ndp_timer() is called right away on the first epoll_wait() cycle,
when the communication channel to the guest isn't ready yet:
1.0038: NDP: sending unsolicited RA, next in 264s
1.0038: tap: failed to send 1 frames of 1
check that it's up before sending it. This effectively delays the
first gratuitous router advertisement, which is probably a good idea
given that we expect the guest to send a router solicitation right
away.
Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>