Sets logging shared memory space.
When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix coding style in a bunch of places]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
vu_dev will be needed to log page update.
Add the parameter to:
vring_used_write()
vu_queue_fill_by_index()
vu_queue_fill()
vring_used_idx_set()
vu_queue_flush()
The new parameter is unused for now.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to vu_set_log_fd_exec()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to union vhost_user_payload]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.
If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
also known as:
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.
That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.
In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.
If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.
This is definitely not elegant, but it's a minimal fix.
We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).
I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Before SO_PEEK_OFF support was introduced by commit e63d281871ef
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).
If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().
With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.
As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.
This is the original report, mentioning the originally proposed fix:
--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--
we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).
Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.
And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.
In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.
While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().
And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.
Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.
Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).
I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).
Fix the following assert:
ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg
What I can see is the buffer can be splitted in two iovecs:
- vnet header
- packet data
This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Increasingly often, I'm getting occasional failures of the same type
as https://github.com/containers/podman/issues/24147. I guess it
mostly depends on the system load.
It will be a while until I'll actually run tests on a kernel
including my fix for it, kernel commit a502ea6fa94b ("udp: Deal with
race between UDP socket address change and rehash"), so add a horrible
workaround using taskset(1), for the moment.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
csum_unfolded() must call csum_avx2() with a 32byte aligned base address.
To be able to do that if the buffer is not correctly aligned,
it splits the buffers in 2 parts, the second part is 32byte aligned and
can be used with csum_avx2(), the first part is the remaining part, that
is not 32byte aligned and we use sum_16b() to compute the checksum.
A problem appears if the length of the first part is odd because
the checksum is using 16bit words to do the checksum.
If the length is odd, when the second part is computed, all words are
shifted by 1 byte, meaning weight of upper and lower byte is swapped.
For instance a 13 bytes buffer:
bytes:
aa AA bb BB cc CC dd DD ee EE ff FF gg
16bit words:
AAaa BBbb CCcc DDdd EEee FFff 00gg
If we don't split the sequence, the checksum is:
AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg
If we split the sequence with an even length for the first part:
(AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg)
But if the first part has an odd length:
(AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF)
To avoid the problem, do not call csum_avx2() if the first part cannot
have an even length, and compute the checksum of all the buffer using
sum_16b().
This is slower but it can only happen if the buffer base address is odd,
and this can only happen if the binary is built using '-Os', and that
means we have chosen to prioritize size over speed.
Reported-by: Mike Jones <mike@mjones.io>
Link: https://bugs.passt.top/show_bug.cgi?id=108
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Added comment explaining why we check for pad & 1]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).
It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:
9165.2467: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2468: Flow 0 (TCP connection (spliced)): 76 from read-side call
9165.2468: Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
9165.2469: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2470: Flow 0 (TCP connection (spliced)): 15 from read-side call
9165.2470: Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
9165.2944: pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)
and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.
In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.
To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.
We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).
This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.
Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.
Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().
However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().
Add accept() to the list of allowed system calls regardless of the
presence of accept4().
Reported-by: NN708 <nn708@outlook.com>
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We don't modify the structure in some virtio functions.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):
$ mkdir /tmp/minidlna
$ cat conf
media_dir=/tmp/minidlna
db_dir=/tmp/minidlna
$ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
1.0327: Flow 0 (NEW): FREE -> NEW
1.0327: Flow 0 (INI): NEW -> INI
1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
1.0327: Flow 0 (FREE): INI -> FREE
1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900
This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.
And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.
Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.
IP-wise, SSDP notifications look like this (after this patch), inside
and outside:
$ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
$ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
2 0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
# tshark -i ens3 -a packets:1 multicast 2>/dev/null
1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I don't think it's necessarily productive to check all the possible
error conditions in the Makefile, but this one is annoying: issue
'make' without a C compiler, then install one, and build again.
Then run passt and it will mysteriously terminate on epoll_wait(),
because seccomp.h is good enough to build against, but the resulting
seccomp filter doesn't allow any system call. Not really fun to debug.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.
In this commit, we add a command switch enabling such funtionality
for inbound local traffic.
For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We need to initialize vhost-user structures with --fd too.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Merge code from tap_backend_init(), tap_sock_tun_init() and
tap_listen_handler() to set epoll_ref entry and to add it
to epollfd.
No functionality change
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In udp_vu_sock_recv(), collect a segment with a size defined to
IP_MAX_MTU + ETH_HLEN + sizeof(struct virtio_net_hdr_mrg_rxbuf)
The original version double counted the IP header: IP_MAX_MTU includes
the IP header, and so did hdrlen.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In flow_sidx_hash() we verify that the flow we're hashing doesn't have an
unspecified endpoint address, or zero for either port. The hash table only
works if we're looking for exact matches of address and port, and this is
attempting to catch any cases where we might have left address or port
unpopulated or filled with a wildcard.
This doesn't really work though, because there are cases where unspecified
addresses or zero ports are correct:
* We already use unspecified addresses for our address in cases where we
don't know the specific local address for that side, and exclude the
obvious extra check on side->oaddr for that reason.
* Zero port numbers aren't strictly forbidden over the wire. We forbid
them for TCP & UDP because they can't safely be handled on the socket
side. However for ICMP a zero id, which goes in the port field is
valid.
* Possible future flow types (for example, for multicast protocols) might
legitimately have an unspecified address.
Although it makes them easier to miss, these sorts of sanity checks really
have to be done at the protocol / flow type layer, and we already do so.
Remove the checks in flow_sidx_hash() other than checking that the pif
is specified.
Reported-by: Stefan <steffhip@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=105
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In udp_flow_new() we reject a flow if the endpoint isn't unicast, or it has
a zero endpoint port. Those conditions aren't strictly illegal, but we
can't safely handle them at present:
* Multicast UDP endpoints are certainly possible, but our current flow
tracking only makes sense for simple unicast flows - we'll need
different handling if we want to handle multicast flows in future
* It's not entirely clear if port 0 is RFC-ishly correct, but for socket
interfaces port 0 sometimes has a special meaning such as "pick the port
for me, kernel". That makes flows on port 0 unsafe to forward in the
usual way.
For the same reason we also can't safely handle port 0 as our port. In
principle that's also true for our address, however in the case of flows
initiated from a socket, we may not know our address since the socket
could be bound to 0.0.0.0 or ::, so we can only verify that our address
is unicast for flows initiated from the tap side.
Refine the current check in udp_flow_new() to slightly more detailed checks
in udp_flow_from_sock() and udp_flow_from_tap() to make what is and isn't
handled clearer. This makes this checking more similar to what we do for
TCP connections.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_vu_data_from_sock() we compute IPv4 header checksum only
for the first and the last packets, and re-use the first packet checksum
for all the other packets as the content of the header doesn't change.
It's more accurate to check the dlen value to know if the checksum
should change as dlen is the only information that can change in the
loop.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TARGET_ARCH is computed from '$(CC) -dumpmachine' using external
bash commands like echo, cut, tr and sed. This can be done using
make internal string functions.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the vhost-user <-> virtio-net path ignores checksums, we usually
don't calculate them when sending packets to the guest. So, we always
pass no_tcp_csum=true to tcp_fill_headers(). We do want accurate
checksums when capturing packets though, so the captures don't show bogus
values.
Currently we handle this by updating the checksum field immediately before
writing the packet to the capture file, using tcp_vu_update_check(). This
is unnecessary, though: in each case tcp_fill_headers() is called not very
long before, so we can alter its no_tcp_csum parameter pased on whether
we're generating captures or not.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers. Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed. This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases. However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.
So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.
More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous. Furthermore, the
payload itself might not be contiguous.
So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it. In the
current caller we've already know where the TCP header is, so we might as
well just pass it in. This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the vhost-user code we have a number of places where we need to locate
a particular header within the guest-supplied IO vector. We need to work
out which buffer the header is in, and verify that it's contiguous and
aligned as we need. At the moment this is open-coded, but introduce a
helper to make this more straightforward.
We add a new datatype 'struct iov_tail' representing an IO vector from
which we've logically consumed some number of headers. The IOV_PULL_HEADER
macro consumes a new header from the vector, returning a pointer and
updating the iov_tail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...to quickly suppress a false positive from Coverity, which assumes
that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen
being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that
iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
If the connection to the vhost-user front end is closed during transfers
virtio rings are deconfigured and not available anymore, but we can
try to access them to process queued data. This can trigger a SIGSEG as
we try to access unavailable memory.
To fix that check vq->vring.avail is sane before accessing the vring
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function only has callers in tcp_buf.c. More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.
Therefore, move it to tcp_buf.c.
Slightly reformat the prototypes while we're at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.
For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.
For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.
We need a 64bit address, so replace struct desc by struct iovec
and update range checking.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It's widely considered a legacy option nowadays, and I've haven't seen
clients setting it since Windows 95, but it's convenient for a minimal
DHCP client not using raw IP sockets such as what I'm playing with for
muvm.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I'm trying to speed up and simplify IP address acquisition in muvm.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We want to add support for option 80 (Rapid Commit, RFC 4039), whose
length is 0.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.
To this end, if we don't find any usable host interface, instead of
exiting:
- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
as default gateway
- for IPv6, don't assign any address (forcibly disable DHCPv6), and
use the *first* link-local address we observe to represent the
guest/container. Advertise fe80::1 as default gateway
- use 'tap0' as default interface name for pasta
Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.
Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).
Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>