1
0
mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00
Commit Graph

1786 Commits

Author SHA1 Message Date
David Gibson
bfc294b90d util: Add helper to write() all of a buffer
write(2) might not write all the data it is given.  Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.

Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:59 +02:00
David Gibson
bb41901c71 tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean
This parameter is already treated as a boolean internally.  Make it a
'bool' type for clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:55 +02:00
David Gibson
265b2099c7 tcp: Simplify ifdef logic in tcp_update_seqack_wnd()
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).

Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:50 +02:00
David Gibson
4aff6f9392 tcp: Clean up tcpi_snd_wnd probing
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest.  That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.

Currently our probing for this is a bit odd.  The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it.  Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field.  This is a bit hit and miss in two ways:
 * Zero is perfectly possible window the peer could report, so we can
   get false negatives
 * We're reading TCP_INFO into a local variable, which might not be zero
   initialised, so if the kernel _doesn't_ write it it could have non-zero
   garbage, giving us false positives.

We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived.  So, check whether that's long enough
to include the field.  This lets us probe the availability of the field
once and for all during initialisation.  That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.

We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:47 +02:00
David Gibson
7d8804beb8 tcp: Make some extra functions private
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header.  Make them static.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:33 +02:00
David Gibson
5ff5d55291 tcp: Avoid overlapping memcpy() in DUP_ACK handling
When handling the DUP_ACK flag, we copy all the buffers making up the ack
frame.  However, all our frames share the same buffer for the Ethernet
header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will
result in a (perfectly) overlapping memcpy().  This seems to have been
harmless so far, but overlapping ranges to memcpy() is undefined behaviour,
so we really should avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:59 +02:00
David Gibson
1f414ed8f0 tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base
This initialisation for IPv4 flags buffers is redundant with the very next
line which sets both iov_base and iov_len.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-12 09:13:46 +02:00
Stefano Brivio
6b38f07239 apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
...for both passt and pasta: use passt's abstraction for this.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
Stefano Brivio
116bc8266d selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range
Since commit eedc81b6ef ("fwd, conf: Probe host's ephemeral ports"),
we might need to read from /proc/sys/net/ipv4/ip_local_port_range in
both passt and pasta.

While pasta was already allowed to open and write /proc/sys/net
entries, read access was missing in SELinux's type enforcement: add
that.

In passt, instead, this is the first time we need to access an entry
there: add everything we need.

Fixes: eedc81b6ef ("fwd, conf: Probe host's ephemeral ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 15:34:06 +02:00
David Gibson
a33ecafbd9 tap: Don't risk truncating frames on full buffer in tap_pasta_input()
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full.  However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer.  If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.

Adjust the code to make sure we always have room for a maximum size frame.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:46 +02:00
David Gibson
d2a1dc744b tap: Restructure in tap_pasta_input()
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around.  This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.

The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass.  This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.

Along the way handle a couple of extra edge cases:
 - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
   ports where those might not have the same value
 - Detect EOF on the tap device and exit in that case

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:43 +02:00
David Gibson
11e29054fe tap: Improve handling of EINTR in tap_passt_input()
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK.  However in all
three cases it returns from the function.  That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again.  For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.

So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:41 +02:00
David Gibson
49fc4e0414 tap: Split out handling of EPOLLIN events
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN.  We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 13:56:37 +02:00
Stefano Brivio
63513e54f3 util: Fix order of operands and carry of one second in timespec_diff_us()
If the nanoseconds of the minuend timestamp are less than the
nanoseconds of the subtrahend timestamp, we need to carry one second
in the subtraction.

I subtracted this second from the minuend, but didn't actually carry
it in the subtraction of nanoseconds, and logged timestamps would jump
back whenever we switched to the first branch of timespec_diff_us()
from the second one.

Most likely, the reason why I didn't carry the second is that I
instinctively thought that swapping the operands would have the same
effect. But it doesn't, in general: that only happens with arithmetic
in modulo powers of 2. Undo the swap as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 13:01:34 +02:00
David Gibson
748ef4cd6e cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.

This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.

That said, in the two places this shows up, it's pretty easy to work
around.  The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.

So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:20 +02:00
Stefano Brivio
afedc2412e tcp: Use EPOLLET for any state of not established connections
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).

While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.

On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.

At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.

Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.

In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.

Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:54:16 +02:00
David Gibson
aff5a49b0e udp: Handle more error conditions in udp_sock_errs()
udp_sock_errs() reads out everything in the socket error queue.  However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.

One possibility is that the error is reported instead by the SO_ERROR
sockopt.  Check for that case and report it as best we can.  If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.

[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010

Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:38 +02:00
David Gibson
bd99f02a64 udp: Treat errors getting errors as unrecoverable
We can get network errors, usually transient, reported via the socket error
queue.  However, at least theoretically, we could get errors trying to
read the queue itself.  Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:35 +02:00
David Gibson
bd092ca421 udp: Split socket error handling out from udp_sock_recv()
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding.  That made sense initially, since
both listening and reply sockets need to do this.  However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.

So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors".  We don't
have any of these so far, but some cases we're working on might require it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:33 +02:00
David Gibson
88bfa3801e flow: Helpers to log details of a flow
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging.  We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions.  Add some
helper functions and macros to make it easy to do that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:30 +02:00
David Gibson
1166401c2f udp: Allow UDP flows to be prematurely closed
Unlike TCP, UDP has no in-band signalling for the end of a flow.  So the
only way we remove flows is on a timer if they have no activity for 180s.

However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow.  We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash).  However it will still wait 3 minutes
to clear away the stale entry.

Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up.  We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:53:24 +02:00
David Gibson
7ad9f9bd2b flow: Fix incorrect hash probe in flowside_lookup()
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions.  Usually
that's implemented by flow_hash_probe().  However, due to some details we
need a second implementation in flowside_lookup().  An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.

In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state.  If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.

Fixes: acca4235c4 ("flow, tcp: Generalise TCP hash table to general flow hash table")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-06 12:52:31 +02:00
Stefano Brivio
0ea60e5a77 log: Don't prefix log file messages with time and severity if they're continuations
In fecb1b65b1 ("log: Don't prefix message with timestamp on --debug
if it's a continuation"), I fixed this for --debug on standard error,
but not for log files: if messages are continuations, they shouldn't
be prefixed by timestamp and severity.

Otherwise, we'll print stuff like this:

  0.0028: ERROR:   Receive error on guest connection, reset0.0028:  ERROR:   : Bad file descriptor

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-06 12:52:21 +02:00
Michal Privoznik
38363964fc Makefile: Enable _FORTIFY_SOURCE iff needed
On some systems source fortification is enabled whenever code
optimization is enabled (e.g. with -O2). Since code fortification
is explicitly enabled too (with possibly different value than the
system wants, there are three levels [1]), distros are required
to patch our Makefile, e.g. [2].

Detect whether fortification is not already enabled and enable it
explicitly only if really needed.

1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html
2: edfeb8763a

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:21 +02:00
David Gibson
eedc81b6ef fwd, conf: Probe host's ephemeral ports
When we forward "all" ports (-t all or -u all), or use an exclude-only
range, we don't actually forward *all* ports - that wouln't leave local
ports to use for outgoing connections.  Rather we forward all non-ephemeral
ports - those that won't be used for outgoing connections or datagrams.

Currently we assume the range of ephemeral ports is that recommended by
RFC 6335, 49152-65535.  However, that's not the range used by default on
Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range
sysctl.

We can't really know what range the guest will consider ephemeral, but if
it differs too much from the host it's likely to cause problems we can't
avoid anyway.  So, using the host's ephemeral range is a better guess than
using the RFC 6335 range.

Therefore, add logic to probe the host's ephemeral range, falling back to
the RFC 6335 range if that fails.  This has the bonus advantage of
reducing the number of ports bound by -t all -u all on most Linux machines
thereby reducing kernel memory usage.  Specifically this reduces kernel
memory usage with -t all -u all from ~380MiB to ~289MiB.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:08 +02:00
David Gibson
4a41dc58d6 conf, fwd: Don't attempt to forward port 0
When using -t all, -u all or exclude-only ranges, we'll attempt to forward
all non-ephemeral port numbers, including port 0.  However, this won't work
as intended: bind() treats a zero port not as literal port 0, but as
"pick a port for me".  Because of the special meaning of port 0, we mostly
outright exclude it in our handling.

Do the same for setting up forwards, not attempting to forward for port 0.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:26:05 +02:00
David Gibson
1daf6f4615 conf, fwd: Make ephemeral port logic more flexible
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams.  Because of that,
they're generally not good choices for listening servers to bind to.

Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports.  Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space.  We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.

Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).

The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:25:51 +02:00
Stefano Brivio
712ca32353 seccomp.sh: Try to account for terminal width while formatting list of system calls
Avoid excess lines on wide terminals, but make sure we don't fail if
we can't fetch the number of columns for any reason, as it's not a
fundamental feature and we don't want to break anything with it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 14:30:17 +02:00
David Gibson
e0be6bc2f4 udp: Use dual stack sockets for port forwarding when possible
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections.  By doing this we halve the number of
listening sockets we need (assuming passt/pasta is listening on the same
ports for IPv4 and IPv6).  When forwarding many ports (e.g. -u all) this
can significantly reduce the amount of kernel memory that passt consumes.

We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual
stack sockets for port forwarding when possible".  Add similar support for
UDP "listening" sockets.  Since UDP sockets don't use as much kernel memory
as TCP sockets this isn't as big a saving, but it's still significant.
When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all),
this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version
6.10.6 on Fedora 40, x86_64).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:41 +02:00
David Gibson
c78b194001 udp: Remove unnnecessary local from udp_sock_init()
The 's' variable is always redundant with either 'r4' or 'r6', so remove
it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:38 +02:00
David Gibson
620e19a1b4 udp: Merge udp[46]_mh_recv arrays
We've already gotten rid of most of the IPv4/IPv6 specific data structures
in udp.c by merging them with each other.  One significant one remains:
udp[46]_mh_recv.  This was a bit awkward to remove because of a subtle
interaction.  We initialise the msg_namelen fields to represent the total
size we have for a socket address, but when we receive into the arrays
those are modified to the actual length of the sockaddr we received.

That meant that naively merging the arrays meant that if we received IPv4
datagrams, then IPv6 datagrams, the addresses for the latter would be
truncated.  In this patch address that by resetting the received
msg_namelen as soon as we've found a flow for the datagram.  Finding the
flow is the only thing that might use the actual sockaddr length, although
we in fact don't need it for the time being.

This also removes the last use of the 'v6' field from udp_listen_epoll_ref,
so remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-27 09:04:25 +02:00
Stefano Brivio
418feb37ec test: Look for possible sshd-session paths (if it's there at all) in mbuto's profile
Some distributions already have OpenSSH 9.8, which introduces split
sshd/sshd-session binaries, and there we need to copy the binary from
the host, which can be /usr/libexec/openssh/sshd-session (Fedora
Rawhide), /usr/lib/ssh/sshd-session (Arch Linux),
/usr/lib/openssh/sshd-session (Debian), and possibly other paths.

Add at least those three, and, if we don't find sshd-session, assume
we don't need it: it could very well be an older version of OpenSSH,
as reported by David for Fedora 40, or perhaps another daemon (would
Dropbear even work? I'm not sure).

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: d6817b3930 ("test/passt.mbuto: Install sshd-session OpenSSH's split process")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-27 09:03:47 +02:00
Stefano Brivio
1d6142f362 README: pasta is indeed a supported back-end for rootless Docker
...https://github.com/moby/moby/issues/48257 just reminded me.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:05:26 +02:00
Stefano Brivio
f00ebda369 util: Don't stop on unrelated values when looking for --fd in close_open_files()
Seen with krun: we get a file descriptor via --fd, but we close it and
happily use the same number for TCP files.

The issue is that if we also get other options before --fd, with
arguments, getopt_long() stops parsing them because it sees them as
non-option values.

Use the - modifier at the beginning of optstring (before :, which is
needed to avoid printing errors) instead of +, which means we'll
continue parsing after finding unrelated option values, but
getopt_long() won't reorder them anyway: they'll be passed with option
value '1', which we can ignore.

By the way, we also need to add : after F in the optstring, so that
we're able to parse the option when given as short name as well.

Now that we change the parsing mode between close_open_files() and
conf(), we need to reset optind to 0, not to 1, whenever we call
getopt_long() again in conf(), so that the internal initialisation
of getopt_long() evaluating GNU extensions is re-triggered.

Link: https://github.com/slp/krun/issues/17#issuecomment-2294943828
Fixes: baccfb95ce ("conf: Stop parsing options at first non-option argument")
Fixes: 09603cab28 ("passt, util: Close any open file that the parent might have leaked")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:53 +02:00
Stefano Brivio
05453ea590 test: Update list of dependencies in README.md
Mostly packages we now need to run Podman-based tests.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:30 +02:00
Stefano Brivio
1a66806c18 tcp, udp: Allow timerfd_gettime64() and recvmmsg_time64() on arm (armhf)
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.

Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.

Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:04:17 +02:00
Stefano Brivio
6e9ecf5741 util: Provide own version of close_range(), and no-op fallback
musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial)
close_range() implementation. This will probably be added to musl
soon, by the way:
  https://www.openwall.com/lists/musl/2024/08/01/9

Add a weakly-aliased implementation, if it's supported by the kernel.
If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31
file descriptors calling close() on them is probably not a good idea.

Reported-by: lemmi <lemmi@nerd2nerd.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:48 +02:00
Stefano Brivio
7291b70ba7 udp_flow: Add missing unistd.h include for close()
For some reason, this is reported only with musl, and older glibc
versions (2.31, at least).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:34 +02:00
Stefano Brivio
396307541e test: Duplicate existing recvfrom() valgrind suppression for recv()
Some architectures, including i686, actually have a recv() system
call, not just a recvfrom(), and we need to cover the recv() with
MSG_TRUNC into a NULL buffer for them as well.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:23 +02:00
Stefano Brivio
d6817b3930 test/passt.mbuto: Install sshd-session OpenSSH's split process
OpenSSH now ships a per-session binary, sshd-session, with sshd
acting as mere listener. It's typically not found in $PATH, so specify
the whole path at which it's commonly installed in $PROGS.

Link: https://www.openssh.com/releasenotes.html#9.8p1
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:03 +02:00
Stefano Brivio
34be8eeb38 test/passt.mbuto: Run sshd from vsock proxy with absolute path
...OpenSSH >= 9.8 otherwise complains that:

  sshd requires execution with an absolute path

Link: https://bugs.gentoo.org/936041
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078429
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:02:37 +02:00
Stefano Brivio
aded2b671c test/lib/setup: Transform i686 kernel architecture name into QEMU name (i386)
It's qemu-system-i386, but uname -m reports i686. I didn't test i486
and i586.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:01:48 +02:00
Stefano Brivio
2aea1da143 treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.

Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:

  https://wiki.debian.org/ReleaseGoals/64bit-time

that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.

Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).

I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.

Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:00:43 +02:00
David Gibson
57b7bd2a48 fwd, conf: Allow NAT of the guest's assigned address
The guest is usually assigned one of the host's IP addresses.  That means
it can't access the host itself via its usual address.  The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host.  However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.

Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.

If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:40 +02:00
David Gibson
8436c0d61b fwd: Distinguish translatable from untranslatable addresses on inbound
fwd_nat_from_host() needs to adjust the source address for new flows coming
from an address which is not accessible to the guest.  Currently we always
use our_tap_addr or our_tap_ll.  However in cases where the address is
accessible to the guest via translation (i.e. via --map-host-loopback) then
it makes more sense to use that translation, rather than the fallback
mapping of our_tap_*.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:37 +02:00
David Gibson
e813a4df7d conf: Allow address remapped to host to be configured
Because the host and guest share the same IP address with passt/pasta, it's
not possible for the guest to directly address the host.  Therefore we
allow packets from the guest going to a special "NAT to host" address to be
redirected to the host, appearing there as though they have both source and
destination address of loopback.

Currently that special address is always the address of the default
gateway (or none).  That can be a problem if we want that gateway to be
addressable by the guest.  Therefore, allow the special "NAT to host"
address to be overridden on the command line with a new --map-host-loopback
option.

In order to exercise and test it, update the passt_in_ns and perf
tests to use this option and give different mapping addresses for the
two layers of the environment.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:35 +02:00
David Gibson
dbaaebbe00 test: Reconfigure IPv6 address after changing MTU
In the TCP throughput tests, we adjust the guest's MTU in order to test
various packet sizes.  Some of those are below 1280 which causes IPv6 to
be deconfigured on the guest interface.  When we increase it above 1280
again, IPv6 is re-enabled and we get an address in the right prefix with
NDP, but we don't get exactly the expected address back - that's only
communicated with --config-net or DHCPv6.

With changes to how we handle NAT this can cause some of the IPv6 tests to
fail, because they don't use the address that passt/pasta expects, and the
guest doesn't initiate any traffic which allows us to learn what the new
address is.

Work around this by re-invoking dhclient -6 between adjusting the MTU and
running IPv6 test cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:33 +02:00
David Gibson
935bd81936 conf, fwd: Split notion of gateway/router from guest-visible host address
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway.  We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT.  So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.

Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances.  So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.

Usually nat_host_loopback will have the same value as guest_gw.  However
when --no-map-gw is specified we leave them unspecified instead.  This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:31 +02:00
David Gibson
90e83d50a9 Don't take "our" MAC address from the host
When sending frames to the guest over the tap link, we need a source MAC
address.  Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
 * We can't preserve the real MAC address of packets from anywhere
   external so there's no transparency case here
 * In fact, it's confusingly different from how we handle IP addresses:
   whereas we give the guest the same IP as the host, we're making the
   host's MAC the one MAC that the guest *can't* use for itself.
 * We already need a fallback case if the host doesn't have an Ethernet
   like MAC (e.g. if it's connected via a point to point interface, such
   as a wireguard VPN).

Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55.  It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps.  This can still, of course, be overridden with the -M option.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:28 +02:00
David Gibson
356de97e43 fwd: Split notion of "our tap address" from gateway for IPv4
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
  1. The router/gateway address as seen by the guest
  2. An address to NAT to the host with --no-map-gw isn't specified
  3. An address to use as source when nothing else makes sense

Case 3 occurs in two situations:

a) for our DHCP responses - since they come from passt internally there's
   no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
   accessible (localhost or the guest's own address).

(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.

For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw).  For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.

Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now.  Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here (e.g. if the host is disconnected or only has a point to point
link with no gateway address).  In that case we have to disable forwarding
of inbound connections with guest-inaccessible source addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:26 +02:00