1
0
mirror of https://passt.top/passt synced 2024-06-28 05:52:41 +00:00
Commit Graph

1029 Commits

Author SHA1 Message Date
Paul Holzinger
3487b5fc93 pasta: do not leak netlink sock into child
When spawning a child command with pasta command... pasta should not
leak fds that it opened. Only the fds that were already open should be
given to the child.

Run `pasta --config-net -- ls -l /proc/self/fd` from a terminal where
only stdin/out/err are open. The fd 3 was opend by ls to read the
/proc/self/fd dir. But fd 5 is the netlink socket that was opend in
pasta. To prevent such a leak we will open the socket with SOCK_CLOEXEC.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 23:42:34 +01:00
David Gibson
7a8ed9459d Make assertions actually useful
There are some places in passt/pasta which #include <assert.h> and make
various assertions.  If we hit these something has already gone wrong, but
they're there so that we a useful message instead of cryptic misbehaviour
if assumptions we thought were correct turn out not to be.

Except.. the glibc implementation of assert() uses syscalls that aren't in
our seccomp filter, so we'll get a SIGSYS before it actually prints the
message.  Work around this by adding our own ASSERT() implementation using
our existing err() function to log the message, and an abort().  The
abort() probably also won't work exactly right with seccomp, but once we've
printed the message, dying with a SIGSYS works just as well as dying with
a SIGABRT.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 23:42:24 +01:00
Stefano Brivio
cc6d8286d1 tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer
David reports that TCP transfers might stall, especially with smaller
socket buffer sizes, because we reset the ACK_FROM_TAP_DUE flag, in
tcp_tap_handler(), whenever we receive an ACK segment, regardless of
its sequence number and the fact that we might still be waiting for
one. This way, we might fail to re-transmit frames on ACK timeouts.

We need, instead, to:

- indicate with the @retrans field only re-transmissions for the same
  data sequences. If we make progress, it should be reset, given that
  it's used to abort a connection when we exceed a given number of
  re-transmissions for the same data

- unset the ACK_FROM_TAP_DUE flag if and only if the acknowledged
  sequence is the same as the last one we sent, as suggested by David

- keep it set otherwise, if progress was done but not all the data we
  sent was acknowledged, and update the expiration of the ACK timeout

Add a new helper for these purposes, tcp_update_seqack_from_tap().

To extend the ACK timeout, the new helper sets the ACK_FROM_TAP_DUE
flag, even if it was already set, and conn_flag_do() triggers a timer
update. This part should be revisited at a later time, because,
strictly speaking, ACK_FROM_TAP_DUE isn't a flag anymore. One
possibility might be to introduce another connection attribute for
events affecting timer deadlines.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Link: https://bugs.passt.top/show_bug.cgi?id=41
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: be5bbb9b06 ("tcp: Rework timers to use timerfd instead of periodic bitmap scan")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 22:26:55 +01:00
Stefano Brivio
ac153595c0 tap: Send frames after the first one in tap_send_frames_pasta()
...instead of repeatedly sending out the first one in iov.

Fixes: e21ee41ac3 ("tcp: Combine two parts of pasta tap send path together")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 14:34:38 +01:00
Stefano Brivio
d8921dafe5 pasta: Wait for tap to be set up before spawning command
Adapted from a patch by Paul Holzinger: when pasta spawns a command,
operating without a pre-existing user and network namespace, it needs
to wait for the tap device to be configured and its handler ready,
before the command is actually executed.

Otherwise, something like:
  pasta --config-net nslookup passt.top

usually fails as the nslookup command is issued before the network
interface is ready.

We can't adopt a simpler approach based on SIGSTOP and SIGCONT here:
the child runs in a separate PID namespace, so it can't send SIGSTOP
to itself as the kernel sees the child as init process and blocks
the delivery of the signal.

We could send SIGSTOP from the parent, but this wouldn't avoid the
possible condition where the child isn't ready to wait for it when
the parent sends it, also raised by Paul -- and SIGSTOP can't be
blocked, so it can never be pending.

Use SIGUSR1 instead: mask it before clone(), so that the child starts
with it blocked, and can safely wait for it. Once the parent is
ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the
child is waiting for it, the kernel will queue it for us, because
it's blocked.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1392bc5ca0 ("Allow pasta to take a command to execute")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 12:22:59 +01:00
David Gibson
54502cca7f udp: Use tap_send_frames()
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching.  First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg().  We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends.  Use
that to considerably simplify things.

This does make a couple of behavioural changes:
  * We used to split messages to keep them under 32kiB (except when a
    single frame was longer than that).  The comments claim this is
    needed to stop qemu from closing the connection, but we don't have any
    equivalent logic for TCP.  I wasn't able to reproduce the problem with
    this series, although it was apparently easy to reproduce earlier.

    My suspicion is that there was never an inherent need to keep messages
    small, however with larger messages (and default kernel buffer sizes)
    the chances of needing more than one resend for partial send()s is
    greatly increased.  We used not to correctly handle that case of
    multiple resends, but now we do.

  * Previously when we got a partial send on UDP, we would resend the
    remainder of the entire "message", including multiple frames.  The
    common code now only resends the remainder of a single frame, simply
    dropping any frames which weren't even partially sent.  This is what
    TCP always did and is probably a better idea for UDP too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:55:04 +01:00
David Gibson
2d553b587a tap: Improve handling of partial frame sends
In passt mode, when writing frames to the qemu socket, we might get a short
send.  If we ignored this and carried on, the qemu socket would get out of
sync, because the bytes we actually sent wouldn't correspond  to the length
header we already sent.  tap_send_frames_passt() handles that by doing a
a blocking send to complete the message, but it has a few flaws:
 * We only attempt to resend once: although it's unlikely in practice,
   nothing prevents the blocking send() from also being short
 * We print a debug error if send() returns non-zero.. but send() returns
   the number of bytes sent, so we actually want it to return the length
   of the remaining data.

Correct those flaws and also be a bit more thorough about reporting
problems here.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:55:01 +01:00
David Gibson
97d1ca2ed6 udp: Use abstracted tap header
Update the UDP code to use the tap layer abstractions for initializing and
updating the L2 and lower headers.  This will make adding other tap
backends in future easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:59 +01:00
David Gibson
0fb7b2b908 tap: Use different io vector bases depending on tap type
Currently tap_send_frames() expects the frames it is given to include the
vnet_len field, even in pasta mode which doesn't use it (although it need
not be initialized in that case).  To match, tap_iov_base() and
tap_iov_len() construct the frame in that way.

This will inconvenience future changes, so alter things to set the buffers
to include just the frame needed by the tap backend type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:57 +01:00
David Gibson
2416310a17 tcp: Use abstracted tap header
Update the TCP code to use the tap layer abstractions for initializing and
updating the L2 and lower headers.  This will make adding other tap
backends in future easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:54 +01:00
David Gibson
4b3d38a069 tap: Add "tap headers" abstraction
Currently both the TCP and UDP code need to deal in various places with the
details of the L2 headers, and also the tap-specific "vnet_len" header.
This makes abstracting the tap interface to new backends (e.g. vhost-user
or tun) more difficult.

To improve this abstraction, create a new 'tap_hdr' structure which
represents both L2 (always Ethernet at the moment, but might be vary in
future) and any additional tap specific headers (such as the qemu socket's
vnet_len field).  Provide helper functions and macros to initialize, update
and use it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:52 +01:00
David Gibson
e4443ba9bd tcp: Consolidate calculation of total frame size
tcp_l2_buf_fill_headers() returns the size of the generated frame including
the ethernet header.  The caller then adds on the size of the vnet_len
field to get the total frame size to be passed to the tap device.

Outside the tap code, though, we never care about the ethernet header size
only the final total size we need to put into an iovec.  So, consolidate
the total frame size calculation within tcp_l2_buf_fill_headers().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:49 +01:00
David Gibson
dcea26076d tcp: Remove redundant and incorrect initialization from *_iov_init()
tcp_sock[46]_iov_init() initialize the length of each iovec buffer to
MSS_DEFAULT.  That will always be overwritten before use in
tcp_data_to_tap, so it's redundant.  It also wasn't correct, because it
didn't correctly account for the header lengths in all cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:47 +01:00
David Gibson
716a926ef4 util: Parameterize ethernet header initializer macro
We have separate IPv4 and IPv6 versions of a macro to construct an
initializer for ethernet headers.  However, now that we have htons_constant
it's easy to simply paramterize this with the ethernet protocol number.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:46 +01:00
David Gibson
67afaab411 tcp, udp: Use named field initializers in iov_init functions
Both the TCP and UDP iov_init functions have some large structure literals
defined in "field order" style.  These are pretty hard to read since it's
not obvious what value corresponds to what field.  Use named field style
initializers instead to make this clearer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:44 +01:00
David Gibson
f5a950115b util: Introduce hton*_constant() in place of #ifdefs
We have several places where we have fairly ugly #ifdefs on __BYTE_ORDER
where we need network order values in a constant expression (so we can't
use htons() or htonl()).  We can do this more cleanly by using a single
__BYTE_ORDER ifdef to define htons_constant() and htonl_constant()
macros, then using those in all the other places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:42 +01:00
David Gibson
6d011c1faa tap, tcp: Move tap send path to tap.c
The functions which do the final steps of sending TCP packets on through
the tap interface - tcp_l2_buf_flush*() - no longer have anything that's
actually specific to TCP in them, other than comments and names.  Move them
all to tap.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:40 +01:00
David Gibson
e21ee41ac3 tcp: Combine two parts of pasta tap send path together
tcp_l2_buf_flush() open codes the loop across each frame in a group, but
but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap
device.  Combine these two pasta-specific operations into
tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further
cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:39 +01:00
David Gibson
a79e774770 tcp: Improve interface to tcp_l2_buf_flush()
Currently this takes a msghdr, but the only thing we actually care
about in there is the io vector.  Make it take an io vector directly.
We also have a weird side effect of zeroing @buf_used.  Just pass this
by value and zero it in the caller instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:36 +01:00
David Gibson
f1d2d35416 tcp: Don't compute total bytes in a message until we need it
tcp[46]_l2_buf_bytes keep track of the total number of bytes we have
queued to send to the tap interface.  tcp_l2_buf_flush_passt() uses this
to determine if sendmsg() has sent all the data we requested, or whether
we need to resend a trailing portion.

However, the logic for finding where we're up to in the case of a short
sendmsg() can equally well tell whether we've had one at all, without
knowing the total number in advance.  This does require an extra loop after
each sendmsg(), but it's doing simple arithmetic on values we've already
been accessing, and it leads to overall simpler code.

tcp[46]_l2_flags_buf_bytes were being calculated, but never used for
anything, so simply remove them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:33 +01:00
David Gibson
99f0be3cad tcp: Combine two parts of passt tap send path together
tcp_l2_buf_flush() open codes the "primary" send of message to the passt
tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a
short send.  Combine these two passt-specific operations into
tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther
cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:30 +01:00
David Gibson
d3089eb0ea pcap: Replace pcapm() with pcap_multiple()
pcapm() captures multiple frames from a msghdr, however the only thing it
cares about in the msghdr is the list of buffers, where it assumes there is
one frame to capture per buffer.  That's what we want for its single caller
but it's not the only obvious choice here (one frame per msghdr would
arguably make more sense in isolation).  In addition pcapm() has logic
that only makes sense in the context of the passt specific path its called
from: it skips the first 4 bytes of each buffer, because those have the
qemu vnet_len rather than the frame proper.

Make this clearer by replacing pcapm() with pcap_multiple() which more
explicitly takes one struct iovec per frame, and parameterizes how much of
each buffer to skip (i.e. the offset of the frame within the buffer).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:54:27 +01:00
David Gibson
cb3c1ce307 pcap: Introduce pcap_frame() helper
pcap(), pcapm() and pcapmm() duplicate some code, for the actual writing
to the capture file.  The purpose of pcapm() and pcapmm() not calling
pcap() seems to be to avoid repeatedly calling gettimeofday() and to avoid
printing errors for every packet in a batch if there's a problem.  We can
accomplish that while still sharing code by adding a new helper which
takes the packet timestamp as a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:48:51 +01:00
David Gibson
b93d025d50 udp: Don't use separate sockets to listen for spliced packets
Currently, when ports are forwarded inbound in pasta mode, we open two
sockets for incoming traffic: one listens on the public IP address and will
forward packets to the tuntap interface.  The other listens on localhost
and forwards via "splicing" (resending directly via sockets in the ns).

Now that we've improved the logic about whether we "splice" any individual
packet, we don't need this.  Instead we can have a single socket bound to
0.0.0.0 or ::, marked as able to splice and udp_sock_handler() will deal
with each packet as appropriate.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:12 +01:00
David Gibson
8d503e825f udp: Decide whether to "splice" per datagram rather than per socket
Currently we have special sockets for receiving datagrams from locahost
which can use the optimized "splice" path rather than going across the tap
interface.

We want to loosen this so that sockets can receive sockets that will be
forwarded by both the spliced and non-spliced paths.  To do this, we alter
the meaning of the @splice bit in the reference to mean that packets
receieved on this socket *can* be spliced, not that they *will* be spliced.
They'll only actually be spliced if they come from 127.0.0.1 or ::1.

We can't (for now) remove the splice bit entirely, unlike with TCP.  Our
gateway mapping means that if the ns initiates communication to the gw
address, we'll translate that to target 127.0.0.1 on the host side.  Reply
packets will therefore have source address 127.0.0.1 when received on the
host, but these need to go via the tap path where that will be translated
back to the gateway address.  We need the @splice bit to distinguish that
case from packets going from localhost to a port mapped explicitly with
-u which should be spliced.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:09 +01:00
David Gibson
8a10f23720 udp: Unify udp_sock_handler_splice() with udp_sock_handler()
These two functions now have a very similar structure, and their first
part (calling recvmmsg()) is functionally identical.  So, merge the two
functions into one.

This does have the side effect of meaning we no longer receive multiple
packets at once for splice (we already didn't for tap).  This does hurt
throughput for small spliced packets, but improves it for large spliced
packets and tap packets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:06 +01:00
David Gibson
f1ed8dbfa7 udp: Pre-populate msg_names with local address
udp_splice_namebuf is now used only for spliced sending, and so it is
only ever populated with the localhost address, either IPv4 or IPv6.
So, replace the awkward initialization in udp_sock_handler_splice()
with statically initialized versions for IPv4 and IPv6.  We then just
need to update the port number in udp_sock_handler_splice().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:03 +01:00
David Gibson
c9e193b5ae udp: Don't handle tap receive batch size calculation within a #define
UDP_MAX_FRAMES gives the maximum number of datagrams we'll ever handle as a
batch for sizing our buffers and control structures.  The subtly different
UDP_TAP_FRAMES gives the maximum number of datagrams we'll actually try to
receive at once for tap packets in the current configuration.

This depends on the mode, meaning that the macro has a non-obvious
dependency on the usual 'c' context variable being available.  We only use
it in one place, so it makes more sense to open code this.  Add an
explanatory comment while we're there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:01 +01:00
David Gibson
4eb54fd2e7 udp: Split receive from preparation and send in udp_sock_handler()
The receive part of udp_sock_handler() and udp_sock_handler_splice() is now
almost identical.  In preparation for merging that, split the receive part
of udp_sock_handler() from the part preparing and sending the frames for
sending on the tap interface.  The latter goes into a new udp_tap_send()
function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:58 +01:00
David Gibson
09c00f1d2a udp: Split sending to passt tap interface into separate function
The last part of udp_sock_handler() does the actual sending of frames
to the tap interface.  For pasta that's just a call to
udp_tap_send_pasta() but for passt, it's moderately complex and open
coded.

For symmetry, move the passt send path into its own function,
udp_tap_send_passt().  This will make it easier to abstract the tap
interface in future (e.g. when we want to add vhost-user).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:55 +01:00
David Gibson
a6e919a951 udp: Move sending pasta tap frames to the end of udp_sock_handler()
udp_sock_handler() has a surprising difference in flow between pasta and
passt mode: For pasta we send each frame to the tap interface as we prepare
it.  For passt, though, we prepare all the frames, then send them with a
single sendmmsg().

Alter the pasta path to also prepare all the frames, then send them at the
end.  We already have a suitable data structure for the passt case.  This
will make it easier to abstract out the tap backend difference in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:06:51 +01:00
Stefano Brivio
c196953f1e test/perf/pasta_tcp: Add host to namespace cases for traffic via tap
Similarly to UDP cases, these were missing as it wasn't clear, when
the other tests were introduced, if using the global address of a
namespace, from the host, should have resulted in connections being
routed via the tap interface.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-01-05 15:08:11 +01:00
Stefano Brivio
37f82ccd9f tcp: Explicitly check option length field values in tcp_opt_get()
Reported by Coverity (CWE-606, Untrusted loop bound), and actually
harmless because we'll exit the option-scanning loop if the remaining
length is not enough for a new option, instead of reading past the
header.

In any case, it looks like a good idea to explicitly check for
reasonable values of option lengths.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-01-05 15:08:07 +01:00
Stefano Brivio
08c01f5b4e test/perf/pasta_udp: Add host to namespace cases for traffic via tap
These were missing as it wasn't clear, when the other tests were
introduced, if using the global address of a namespace, from the
host, should have resulted in traffic being routed via the tap
interface (as opposed to the loopback interface). We now clarified
that's actually the case.

Use same values and thresholds as the tests for loopback traffic, as
throughput figures currently indicate there isn't much difference.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-01-05 15:07:54 +01:00
David Gibson
310bdbdcf4 udp: Factor out control structure management from udp_sock_fill_data_v[46]
The main purpose of udp_sock_fill_data_v[46]() is to construct the IP, UDP
and other headers we'll need to forward data onto the tap interface.  In
addition they update the control structures (iovec and mmsghdr) we'll need
to send the messages, and in the case of pasta actually sends it.

This leads the control structure management and the send itself awkwardly
split between udp_sock_fill_data_v[46]() and their caller
udp_sock_handler().  In addition, this tail part of udp_sock_fill_datav[46]
is essentially common between the IPv4 and IPv6 versions, apart from which
control array we're working on.

Clean this up by reducing these functions to just construct the headers
and renaming them to udp_update_hdr[46]() accordingly.  The control
structure updates are now all in the caller, and common for IPv4 and IPv6.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:22 +01:00
David Gibson
4b2d227c86 udp: Preadjust udp[46]_l2_iov_tap[].iov_base for pasta mode
Currently, we always populate udp[46]_l2_iov_tap[].iov_base with the
very start of the header buffers, including space for the qemu vnet_len
tag suitable for passt mode.  That's ok because we don't actually use these
iovecs for pasta mode.

However, we do know the mode in udp_sock[46]_iov_init() so adjust these
to the beginning of the headers we'll actually need for the mode: including
the vnet_len tag for passt, but excluding it for pasta.

This allows a slightly nicer way to locate the right buffer to send in the
pasta case, and will allow some additional cleanups later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:19 +01:00
David Gibson
d533807027 udp: Better factor IPv4 and IPv6 paths in udp_sock_handler()
Apart from which mh array they're operating on the recvmmsg() calls in
udp_sock_handler() are identical between the IPv4 and IPv6 paths, as are
some of the control structure updates.

By using some local variables to refer to the IP version specific control
arrays, make some more logic common between the IPv4 and IPv6 paths.  As
well as slightly reducing the code size, this makes it less likely that
we'll accidentally use the IPv4 arrays in the IPv6 path or vice versa as we
did in a recently fixed bug.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:16 +01:00
David Gibson
6af7ee74cf udp: Fix incorrect use of IPv6 mh buffers in IPv4 path
udp_sock_handler() incorrectly uses udp6_l2_mh_tap[] on the IPv4 path.  In
fact this is harmless because this assignment is redundant (the 0th entry
msg_hdr will always point to the 0th iov entry for both IPv4 and IPv6 and
won't change).

There is also an incorrect usage of udp6_l2_mh_tap[] in
udp_sock_fill_data_v4.  This one can cause real problems, because we'll
use stale iov_len values if we send multiple messages to the qemu socket.
Most of the time that will be relatively harmless - we're likely to either
drop UDP packets, or send duplicates.  However, if the stale iov_len we
use ends up referencing an uninitialized buffer we could desynchronize the
qemu stream socket.

Correct both these bugs.  The UDP6 path appears to be correct, but it does
have some comments that incorrectly reference the IPv4 versions, so fix
those as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:07 +01:00
David Gibson
34764ea4f3 udp: Correct splice forwarding when receiving from multiple sources
udp_sock_handler_splice() reads a whole batch of datagrams at once with
recvmmsg().  It then forwards them all via a single socket on the other
side, based on the source port.

However, it's entirely possible that the datagrams in the set have
different source ports, and thus ought to be forwarded via different
sockets on the destination side.  In fact this situation arises with the
iperf -P4 throughput tests in our own test suite.  AFAICT we only get away
with this because iperf3 is strictly one way and doesn't send reply packets
which would be misdirected because of the incorrect source ports.

Alter udp_sock_handler_splice() to split the packets it receives into
batches with the same source address and send each batch with a separate
sendmmsg().

For now we only look for already contiguous batches, which means that if
there are multiple active flows interleaved this is likely to degenerate
to batches of size 1.  For now this is the simplest way to correct the
behaviour and we can try to optimize later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:03 +01:00
David Gibson
2dec914209 udp: Split send half of udp_sock_handler_splice() from the receive half
Move the part of udp_sock_handler_splice() concerned with sending out the
datagrams into a new udp_splice_sendfrom() helper.  This will make later
cleanups easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:42:00 +01:00
David Gibson
fc7f91e709 udp: Unify buffers for tap and splice paths
We maintain a set of buffers for UDP packets to be forwarded via the tap
interface in udp[46]_l2_buf.  We then have a separate set of buffers for
packets to be "spliced" in udp_splice_buf[].  However, we only use one of
these at a time, so we can share the buffer space.

For the receiving splice packets we can not only re-use the data buffers
but also the udp[46]_l2_iov_sock and udp[46]_l2_mh_sock control structures.

For sending the splice packets we keep the same data buffers, but we need
specific control structures.  We create udp[46]_iov_splice - we can't
reuse udp_l2_iov_sock[] because we need to write iov_len as we're writing
spliced packets, but the tap path expects iov_len to remain the same (it
only uses it for receive).  Likewise we create udp[46]_mh_splice with the
mmsghdr structures for sending spliced packets.  As well as needing to
reference different iovs, these need to all reference udp_splice_namebuf
instead of individual msg_name fields for each slot.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:56 +01:00
David Gibson
c52ca4aecf udp: Add helper to extract port from a sockaddr_in or sockaddr_in6
udp_sock_handler_splice() has a somewhat clunky if to extract the port from
a socket address which could be either IPv4 or IPv6.  Future changes are
going to make this even more clunky, so introduce a helper function to
do this extraction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:53 +01:00
David Gibson
e7e2a321ba udp: Make UDP_SPLICE_FRAMES and UDP_TAP_FRAMES_MEM the same thing
These two constants have the same value, and there's not a lot of reason
they'd ever need to be different.  Future changes will further integrate
the spliced and "tap" paths so that these need to be the same.  So, merge
them into UDP_MAX_FRAMES.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:51 +01:00
David Gibson
5b0027f942 udp: Simplify udp_sock_handler_splice
Previous cleanups mean that we can now rework some complex ifs in
udp_sock_handler_splice() into a simpler set.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:48 +01:00
David Gibson
71d2595a8f udp: Update UDP "connection" timestamps in both directions
A UDP pseudo-connection between port A in the init namespace and port B in
the pasta guest namespace involves two sockets: udp_splice_init[v6][B]
and udp_splice_ns[v6][A].  The socket which originated this "connection"
will be permanent but the other one will be closed on a timeout.

When we get a packet from the originating socket, we update the timeout on
the other socket, but we don't do the same when we get a reply packet from
the other socket.  However any activity on the "connection" probably
indicates that it's still in use.  Without this we could incorrectly time
out a "connection" if it's using a protocol which involves a single
initiating packet, but which then gets continuing replies from the target.

Correct this by updating the timeout on both sockets for a packet in either
direction.  This also updates the timestamps for the permanent originating
sockets which is unnecessary, but harmless.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:45 +01:00
David Gibson
7610034fef udp: Don't explicitly track originating socket for spliced "connections"
When we look up udp_splice_to_ns[][].orig_sock in udp_sock_handler_splice()
we're finding the socket on which the originating packet for the
"connection" was received on.  However, we don't specifically need this
socket to be the originating one - we just need one that's bound to the
the source port of this reply packet in the init namespace.  We can look
this up in udp_splice_to_init[v6][src].target_sock, whose defining
characteristic is exactly that.  The same applies with init and ns swapped.

In practice, of course, the port we locate this way will always be the
originating port, since we couldn't have started this "connection" if it
wasn't.

Change this, and we no longer need the @orig_sock field at all. That
leaves just @target_sock which we rename to simply @sock.  The whole
udp_splice_flow structure now more represents a single bound port than
a "flow" per se, so rename and recomment it accordingly.  Likewise the
udp_splice_to_{ns,init} names are now misleading, since the ports in
those maps are used in both directions.  Rename them to
udp_splice_{ns,init} indicating the location where the described
socket is bound.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:42 +01:00
David Gibson
27bfebb061 udp: Re-use fixed bound sockets for packet forwarding when possible
When we look up udp_splice_to_ns[v6][src].target_sock in
udp_sock_handler_splice, all we really require of the socket is that it
be bound to port src in the pasta guest namespace.  Similarly for
udp_splice_to_init but bound in the init namespace.

Usually these sockets are created temporarily by udp_splice_connect() and
cleaned up by udp_timer().  However, depending on the -u and -U options its
possible we have a permanent socket bound to the relevant port created by
udp_sock_init().  If such a socket exists, we could use it instead of
creating a temporary one.  In fact we *must* use it, because we'll fail
trying to bind() a temporary one to the same port.

So allow this, store permanently bound sockets into udp_splice_to_{ns,init}
in udp_sock_init().  These won't get incorrectly removed by the timer
because we don't put a corresponding entry in the udp_act[] structure
which directs the timer what to clean up.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:40 +01:00
David Gibson
c277c6dd7d udp: Don't create double sockets for -U port
For each IP version udp_socket() has 3 possible calls to sock_l4().  One
is for the "non-spliced" bound socket in the init namespace, one for the
"spliced" bound socket in the init namespace and one for the "spliced"
bound socket in the pasta namespace.

However when this is called to create a socket in the pasta namspeace there
is a logic error which causes it to take the path for the init side spliced
socket as well as the ns socket.  This essentially tries to create two
identical sockets on the ns side.  Unsurprisingly the second bind() call
fails according to strace.

Correct this to only attempt to open one socket within the ns.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:35 +01:00
David Gibson
d9394eb9b7 udp: Split splice field in udp_epoll_ref into (mostly) independent bits
The @splice field in union udp_epoll_ref can have a number of values for
different types of "spliced" packet flows.  Split it into several single
bit fields with more or less independent meanings.  The new @splice field
is just a boolean indicating whether the socket is associated with a
spliced flow, making it identical to the @splice fiend in tcp_epoll_ref.

The new bit @orig, indicates whether this is a socket which can originate
new udp packet flows (created with -u or -U) or a socket created on the
fly to handle reply socket.  @ns indicates whether the socket lives in the
init namespace or the pasta namespace.

Making these bits more orthogonal to each other will simplify some future
cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:31 +01:00
David Gibson
8517239243 udp: Remove the @bound field from union udp_epoll_ref
We set this field, but nothing ever checked it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:28 +01:00