1
0
mirror of https://passt.top/passt synced 2025-01-18 10:25:17 +00:00

1453 Commits

Author SHA1 Message Date
David Gibson
ee677e0a42 tcp_splice: Improve logic deciding when to splice
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.

 * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
   check for pasta mode within it, better localising the checks.
 * Previously if we got a connection from a non-loopback address we'd
   always fall back to the "tap" path, even if the  connection was on a
   socket in the namespace.  If we did get a non-loopback address on a
   namespace socket, something has gone wrong and the "tap" path certainly
   won't be able to handle it.  Report the error and close, rather than
   passing it along to tap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:13 +01:00
David Gibson
4c2d923b12 tcp_splice: Improve error reporting on connect path
This makes a number of changes to improve error reporting while
connecting a new spliced socket:
* We use flow_err() and similar functions so all messages include info
   on which specific flow was affected
 * We use strerror() to interpret raw error values
 * We now report errors on connection (at "trace" level, since this would
   allow spamming the logs)
 * We also look up and report some details on EPOLLERR events, which can
   include connection errors, since we use a non-blocking connect().  Again
   we use "trace" level since this can spam the logs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:11 +01:00
David Gibson
f0e2a6b8c9 tcp_splice: Make tcp_splice_connect() create its own sockets
Currently creating the connected socket for a splice is split between
tcp_splice_conn_from_sock(), which opens the socket, and
tcp_splice_connect() which connects it.  Alter tcp_splice_connect() to
open its own socket based on an address family and pif we pass it.

This does require a second conditional on pif, but makes for a more
logical split of functionality: tcp_splice_conn_from_sock() picks the
target, tcp_splice_connect() creates the connection.  While we're
there improve reporting of errors

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:09 +01:00
David Gibson
f4e5d73684 tcp_splice: Merge tcp_splice_new() into its caller
The only caller of tcp_splice_new() is tcp_splice_conn_from_sock().
Both are quite short, and the division of responsibilities between the
two isn't particularly obvious.  Simplify by merging the former into
the latter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:06 +01:00
David Gibson
04d3d02603 tcp_splice: More specific variable names in new splice path
In tcp_splice_conn_from_sock(), the 'port' variable stores the source
port of the connection on the originating side.  In tcp_splice_new(),
called directly from it, the 'port' parameter gives the _destination_
port of the originating connection and is then updated to the
destination port of the connection on the other side.

Similarly, in tcp_splice_conn_from_sock(), 's' is the fd of the
accetped socket (on side 0), whereas in tcp_splice_new(), 's' is the
fd of the connecting socket (side 1).

I, for one, find having the same variable name with different meanings
in such close proximity in the flow of control pretty confusing.
Alter the names for greater specificity and clarity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:03 +01:00
David Gibson
0f938c3b9a flow: Clarify flow entry life cycle, introduce uniform logging
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.

In addition, make a FLOW_START() macro to mark one of the important
transitions.  This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it.  It
also logs that a new flow has been created, which is useful for
debugging.

We also add logging when a flow's lifecycle ends.  This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:01 +01:00
David Gibson
d0550f97cd tcp_splice: Don't use flow_trace() before setting flow type
In tcp_splice_conn_from_sock() we can call flow_trace() if there's an
error setting TCP_QUICKACK.  However, we do so before we've set the
flow type in the flow entry.  That means that flow_trace() will print
nonsense when it tries to print the flow type.

There's no reason the setsockopt() has to happen before initialising
the flow entry, so just move it after.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:58 +01:00
David Gibson
80f9b61b50 tcp_splice: Simplify clean up logic
Currently tcp_splice_flow_defer() contains specific logic to determine
if we're far enough initialised that we need to close pipes and/or
sockets.  This is potentially fragile if we change something about the
order in which we do things.  We can simplify this by initialising the
pipe and socket fields to -1 very early, then close()ing them if and
only if they're non-negative.

This lets us remove a special case cleanup if our connect() fails.
This will already trigger a CLOSING event, and the socket fd in
question is populated in the connection structure.  Thus we can let
the new cleanup logic handle it rather than requiring an explicit
close().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:48 +01:00
David Gibson
76c7e1dca3 flow: Add helper to determine a flow's protocol
Each flow already has a type field.  This implies the protocol the
flow represents, but also has more information: we have two ways to
represent TCP flows, "tap" and "spliced".  In order to generalise some
of the flow mechanics, we'll need to determine a flow's protocol in
terms of the IP (L4) protocol number.

Introduce a constant table and helper macro to derive this from the flow
type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:45 +01:00
David Gibson
bb9bf0bb8f tcp, udp: Don't precompute port remappings in epoll references
The epoll references for both TCP listening sockets and UDP sockets
includes a port number.  This gives the destination port that traffic
to that socket will be sent to on the other side.  That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.

As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address).  So,
it will no longer make sense to have a fixed value for a listening
socket.

Change to simpler semantics where this field in the reference gives
the bound port of the socket.  We apply the translations to the
correct destination port later on, when we're actually forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:40 +01:00
David Gibson
e196eada6f util: Allow IN4_IS_* macros to operate on untyped addresses
The IN4_IS_*() macros expect a pointer to a struct in_addr.  That
makes sense, but sometimes we have an IPv4 address as a void * pointer
or union type which makes these less convenient.  Additionally, this
doesn't match the behaviour of the standard library's IN6_IS_*()
macros on which they're modelled, nor our own IN4_ARE_ADDR_EQUAL().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:35 +01:00
David Gibson
f6e6e8ad40 inany: Introduce union sockaddr_inany
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6.  In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address.  Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.

Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6.  Use it in a number of
places where it's easy to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:31 +01:00
David Gibson
0cf6b2d89d inany: Provide more conveniently typed constants for special addresses
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely.  We don't yet
provide constants of this type for special addresses (loopback and
"any").  Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.

To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any.  For the IPv4 versions we need to create
new constant globals.

For complicated historical reasons, the standard library doesn't
provide constants for IPv4 loopback and any addresses as struct
in_addr.  It just has macros of type in_addr_t == uint32_t, which has
some gotchas w.r.t. endianness.  We can use some more macros to
address this lack, using macros to effectively create these IPv4
constants as pieces of the inany constants above.

We use this last to avoid some awkward temporary variables just used
to get an address of an IPv4 loopback address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:28 +01:00
David Gibson
330b5db77d inany: Add inany_ntop() helper
Add this helper to format an inany into either IPv4 or IPv6 text
format as appropriate.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:25 +01:00
David Gibson
d31277e292 inany: Helper to test for various address types
Add helpers to determine if an inany is loopback, unspecified or
multicast, regardless of whether it's a "true" IPv6 address or an IPv4
address represented as v4-mapped.

Use the loopback helper to simplify tcp_splice_conn_from_sock() slightly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:21 +01:00
David Gibson
9a3fb5eb68 tap: Use write_remainder() in tap_send_frames_passt()
When we determine we have sent a partial frame in tap_send_frames_passt(),
we call tap_send_remainder() to send the remainder of it.  The logic in
that function is very similar to that in the more general write_remainder()
except that it uses send() instead of write()/writev().  But we are dealing
specifically with the qemu socket here, which is a connected stream socket.
In that case write()s do the same thing as send() with the options we were
using, so we can just reuse write_remainder().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:35:03 +01:00
David Gibson
dda7945ca9 pcap: Handle short writes in pcap_frame()
Currently pcap_frame() assumes that if write() doesn't return an error, it
has written everything we want.  That's not necessarily true, because it
could return a short write.  That's not likely to happen on a regular file,
but there's not a lot of reason not to be robust here; it's conceivable we
might want to direct the pcap fd at a named pipe or similar.

So, make pcap_frame() handle short frames by using the write_remainder()
helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Formatting fix, and avoid gcc warning in pcap_frame()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:35:01 +01:00
David Gibson
8bdb0883b4 util: Add write_remainder() helper
We have several places where we want to write(2) a buffer or buffers and we
handle short write()s by retrying until everything is successfully written.
Add a helper for this in util.c.

This version has some differences from the typical write_all() function.
First, take an IO vector rather than a single buffer, because that will be
useful for some of our cases.  Second, allow it to take an parameter to
skip the first n bytes of the given buffers.  This will be useful for some
of the cases we want, and also falls out quite naturally from the
implementation.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor formatting fixes in write_remainder()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:25:17 +01:00
David Gibson
24410b37a4 pcap: Update pcap_frame() to take an iovec and offset
Update the low-level helper pcap_frame() to take a struct iovec and
offset within it, rather than an explicit pointer and length for the
frame.  This moves the handling of an offset (to skip vnet_len) from
pcap_multiple() to pcap_frame().

This doesn't accomplish a great deal immediately, but will make
subsequent changes easier.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:24:10 +01:00
David Gibson
64b63d9e3e iov: Add helper to find skip over first n bytes of an io vector
Several of the IOV functions in iov.c, and also tap_send_frames_passt()
needs to determine which buffer element a byte offset into an IO vector
lies in.  Split this out into a helper function iov_skip_bytes().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:24:07 +01:00
Laurent Vivier
2a6f8bcca7 iov: add some functions to manage iovec
Introduce functions to copy to/from a buffer from/to an iovec array,
to compute data length in in bytes of an iovec and to copy memory from
an iovec to another.

iov_from_buf(), iov_to_buf(), iov_size(), iov_copy().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240217150725.661467-2-lvivier@redhat.com>
[dwg: Small changes to suppress cppcheck warnings]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:23:49 +01:00
David Gibson
90f1d3b354 udp: Remove unnecessary test for unspecified addr_out
If the configured output address is unspecified, we don't set the bind
address to it when creating a new socket in udp_tap_handler().  That sounds
sensible, but what we're leaving the bind address as is, exactly, the
unspecified address, so this test makes no difference.  Remove it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:23:47 +01:00
David Gibson
745fa38169 udp: Fix incorrect usage of IPv6 state in IPv4 path
When forwarding IPv4 packets in udp_tap_handler(), we incorrectly use an
IPv6 address test on our IPv4 address (which could cause an out of bounds
access), and possibly set our bind interface to the IPv6 interface based on
it.  Adjust to correctly look at the IPv4 address and IPv4 interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 05:41:03 +01:00
David Gibson
deea5a8437 udp: Small streamline to udp_update_hdr4()
Streamline the logic here slightly, by introducing a 'src' temporary for
brevity.  We also transform the logic for setting/clearing PORT_LOOPBACK.
This makes udp_update_hdr4() more closely match the corresponding logic
from udp_update_udp6().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 05:41:01 +01:00
David Gibson
bc2d0d381c udp: Set pif in epoll reference for ephemeral host sockets
The udp_epoll_ref contains a field for the pif to which the socket belongs.
We fill this in for permanent sockets created with udp_sock_init() and for
spliced sockets, however, we omit it for ephemeral sockets created for
tap originated flows.

This is a bug, although we currently get away with it, because we don't
consult that field for such flows.  Correctly fill it in.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 05:40:59 +01:00
David Gibson
720d777a69 udp: Don't attempt to translate a 0.0.0.0 source address
If an incoming packet has a source address of 0.0.0.0 we translate that to
the gateway address.  This doesn't really make sense, because we have no
way to do a reverse translation for reply packets.

Certain UDP protocols do use an unspecified source address in some
circumstances (e.g. DHCP).  These generally either require no reply, a
multicast reply, or provide a suitable reply address by other means.

In none of those cases does translating it in passt/pasta make sense.  The
best we can really do here is just leave it as is.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 05:40:46 +01:00
Stefano Brivio
bee61dd7d0 conf: If no interface with a default route was found, say it
...instead of implying that by stating that there's no routable
interface for a given IP version. There might be interfaces with
non-default routes.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-02-28 18:57:49 +01:00
Stefano Brivio
925af4ef82 Makefile: check for cppcheck's --check-level option in cppcheck target
Don't run cppcheck to find out if the --check-level=exhaustive option
is available, unless we're actually going to run cppcheck later.

To avoid this, move this check under the cppcheck target, and
implement it in shell script instead of using Makefile directives,
because we can't easily implement conditionals in recipes.

Reported-by: Rahil Bhimjiani <me@rahil.website>
Link: https://bugs.gentoo.org/920795
Fixes: 8640d62af719 ("cppcheck: Use "exhaustive" level checking when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-28 18:57:30 +01:00
Paul Holzinger
15001b39ef conf: set the log level much earlier
--quiet is supposed to silence the "No routable interface" message but
it does not work because the log level was set long after conf_ip4/6()
was called which means it uses the default level which logs everything.

To address this move the log level logic directly after the option
parsing in conf().

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 14:08:33 +01:00
Paul Holzinger
b08716551a passt: make --quiet set the log level to warning
Based on the man page and help output --quiet hides informational
messages. This means that warnings should still be logged. This was
discussed in[1].

[1] https://archives.passt.top/passt-dev/20240216114304.7234a83f@elisabeth/T/#m42652824644973674e84baf9e0bf1d0e88104450

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 14:08:00 +01:00
David Gibson
e5e6f29459 tcp: Don't store errnos in socket pool
If tcp_sock_refill_pool() gets an error opening new sockets, it stores the
negative errno of that error in the socket pool.  This isn't especially
useful:
  * It's inconsistent with the initial state of the pool (all -1)
  * It's inconsistent with the state of an entry that was valid and was
    then consumed (also -1)
  * By the time we did anything with this error code, it's now far removed
    from the situation in which the error occurred, making it difficult to
    report usefully

We now have error reporting closer to when failures happen on the refill
paths, so just leave a pool slot we can't fill as -1.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:53:30 +01:00
David Gibson
fe27ebce5c tcp, tcp_splice: Helpers for getting sockets from the pools
We maintain pools of ready-to-connect sockets in both the original and
(for pasta) guest namespace to reduce latency when starting new TCP
connections.  If we exhaust those pools we have to take a higher
latency path to get a new socket.

Currently we open-code that fallback in the places we need it.  To improve
clarity encapsulate that into helper functions.  While we're at it, give
those helpers clearer error reporting.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:46 +01:00
David Gibson
fbe81decbd tcp, tcp_splice: Issue warnings if unable to refill socket pool
Currently if tcp_sock_refill_pool() is unable to fill all the slots in the
pool, it will silently exit.  This might lead to a later attempt to get
fds from the pool to fail at which point it will be harder to tell what
originally went wrong.

Instead add warnings if we're unable to refill any of the socket pools when
requested.  We have tcp_sock_refill_pool() return an error and report it
in the callers, because those callers have more context allowing for a
more useful message.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:44 +01:00
David Gibson
554b3aefe7 tcp: Stop on first error when refilling socket pools
Currently if we get an error opening a new socket while refilling a socket
pool, we carry on to the next slot and try again.  This isn't very useful,
since by far the most likely cause of an error is some sort of resource
exhaustion.  Trying again will probably just hit the same error, and maybe
even make things worse.

So, instead stop on the first error while refilling the pool, making do
with however many sockets we managed to open before the error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:17 +01:00
David Gibson
af303fdbff tcp: Don't stop refilling socket pool if we find a filled entry
Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the
pool with a valid fd.  This appears to makes sense: we always use fds from
the front of the pool, so if we find a filled one, the rest of the pool
should be filled as well.

However, that's not quite correct.  If a previous refill hit errors trying
to open new sockets, it could leave gaps between blocks of valid fds. We're
going to add some changes that could make that more likely.

So, for robustness, instead skip over the filled entry but still try to
refill the rest of the array.  We expect simply iterating over the pool to
be of small cost compared to even a single system call, so this shouldn't
have much impact.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:14 +01:00
David Gibson
4e08d9b9c6 treewide: Use sa_family_t for address family variables
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int.  Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:02 +01:00
David Gibson
1e6f92b995 udp: Fix 16-bit overflow in udp_invert_portmap()
The code in udp_invert_portmap() is written based on an incorrect
understanding of C's (arcane) integer promotion rules.  We calculate
'(in_port_t)i + delta' expecting the result to be of type in_port_t (16
bits).  However "small integer types" (those narrower than 'int') are
always promoted to int for expressions, meaning this calculation can
overrun the rdelta[] array.

Fix this, and use a new intermediate for the index, to make it very clear
what it's type is.  We also change i to unsigned, to avoid any possible
confusion from mixing signed and unsigned types.

Link: https://bugs.passt.top/show_bug.cgi?id=80
Reported-by: Laurent Jacquot <jk@lutty.net>
Suggested-by: Laurent Jacquot <jk@lutty.net>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024_02_20.1e6f92b
2024-02-20 08:14:28 +01:00
David Gibson
8954c4a91b udp: Assertion in udp_invert_portmap() can be calculated at compile time
All the values in this ASSERT() are known at compile time, so this can be
converted to a static_assert().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-20 08:14:08 +01:00
Stefano Brivio
ff22a78d7b pasta: Don't try to watch namespaces in procfs with inotify, use timer instead
We watch network namespace entries to detect when we should quit
(unless --no-netns-quit is passed), and these might stored in a tmpfs
typically mounted at /run/user/UID or /var/run/user/UID, or found in
procfs at /proc/PID/ns/.

Currently, we try to use inotify for any possible location of those
entries, but inotify, of course, doesn't work on pseudo-filesystems
(see inotify(7)).

The man page reflects this: the description of --no-netns-quit
implies that we won't quit anyway if the namespace is not "bound to
the filesystem".

Well, we won't quit, but, since commit 9e0dbc894813 ("More
deterministic detection of whether argument is a PID, PATH or NAME"),
we try. And, indeed, this is harmless, as the caveat from that
commit message states.

Now, it turns out that Buildah, a tool to create container images,
sharing its codebase with Podman, passes a procfs entry to pasta, and
expects pasta to exit once the network namespace is not needed
anymore, that is, once the original container process, also spawned
by Buildah, terminates.

Get this to work by using the timer fallback mechanism if the
namespace name is passed as a path belonging to a pseudo-filesystem.
This is expected to be procfs, but I covered sysfs and devpts
pseudo-filesystems as well, because nothing actually prevents
creating this kind of directory structure and links there.

Note that fstatfs(), according to some versions of man pages, was
apparently "deprecated" by the LSB. My reasoning for using it is
essentially this:
  https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24szs4raroaz@ogivjbfdaqtb/t/#u

...that is, there was no such thing as an LSB deprecation, and
anyway there's no other way to get the filesystem type.

Also note that, while it might sound more obvious to detect the
filesystem type using fstatfs() on the file descriptor itself
(c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
matter what path was given to pasta. If we use the parent directory,
we'll typically have either tmpfs or procfs reported.

If the target namespace is given as a PID, or as a PID-based procfs
entry, we don't risk races if this PID is recycled: our handle on
/proc/PID/ns will always refer to the original namespace associated
with that PID, and we don't re-open this entry from procfs to check
it.

There's, however, a remaining race possibility if the parent process
is not the one associated to the network namespace we operate on: in
that case, the parent might pass a procfs entry associated to a PID
that was recycled by the time we parse it. This can't happen if the
namespace PID matches the one of the parent, because we detach from
the controlling terminal after parsing the namespace reference.

To avoid this type of race, if desired, we could add the option for
the parent to pass a PID file descriptor, that the parent obtained
via pidfd_open(). This is beyond the scope of this change.

Update the man page to reflect that, even if the target network
namespace is passed as a procfs path or a PID, we'll now quit when
the procfs entry is gone.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024_02_19.ff22a78
2024-02-19 19:58:50 +01:00
Stefano Brivio
08344dacb1 selinux: Allow pasta to remount procfs
Partially equivalent to commit abf5ef6c22d2 ("apparmor: Allow pasta
to remount /proc, access entries under its own copy"): we should
allow pasta to remount /proc. It still works otherwise, but further
UID remapping in nested user namespaces (e.g. pasta in pasta) won't.

Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=79#c3
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024_02_16.08344da
2024-02-16 09:43:12 +01:00
Stefano Brivio
338b6321ac conf: No routable interface for IPv4 or IPv6 is informational, not a warning
...Podman users might get confused by the fact that if we can't
find a default route for a given IP version, we'll report that as a
warning message and possibly just before actual error messages.

However, a lack of routable interface for IPv4 or IPv6 can be a
normal circumstance: don't warn about it, just state that as
informational message, if those are displayed (they're not in
non-error paths in Podman, for example).

While at it, make it clear that we're disabling IPv4 or IPv6 if
there's no routable interface for the corresponding IP version.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-16 08:47:14 +01:00
Stefano Brivio
8f3f8e190c pasta: Add fallback timer mechanism to check if namespace is gone
We don't know how frequently this happens, but hitting
fs.inotify.max_user_watches or similar sysctl limits is definitely
not out of question, and Paul mentioned that, for example, Podman's
CI environments hit similar issues in the past.

Introduce a fallback mechanism based on a timer file descriptor: we
grab the directory handle at startup, and we can then use openat(),
triggered periodically, to check if the (network) namespace directory
still exists. If openat() fails at some point, exit.

Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-16 08:47:14 +01:00
Stefano Brivio
f57a2fb4d5 conf, passt.1: Exit if we can't bind a forwarded port, except for -[tu] all
...or similar, that is, if only excluded ranges are given (implying
we'll forward any other available port). In that case, we'll usually
forward large sets of ports, and it might be inconvenient for the
user to skip excluding single ports that are already taken.

The existing behaviour, that is, exiting only if we fail to bind all
the ports for one given forwarding option, turns out to be
problematic for several aspects raised by Paul:

- Podman merges ranges anyway, so we might fail to bind all the ports
  from a specific range given by the user, but we'll not fail anyway
  because Podman merges it with another one where we succeed to bind
  at least one port. At the same time, there should be no semantic
  difference between multiple ranges given by a single option and
  multiple ranges given as multiple options: it's unexpected and
  not documented

- the user might actually rely on a given port to be forwarded to a
  given container or a virtual machine, and if connections are
  forwarded to an unrelated process, this might raise security
  concerns

- given that we can try and fail to bind multiple ports before
  exiting (in case we can't bind any), we don't have a specific error
  code we can return to the user, so we don't give the user helpful
  indication as to why we couldn't bind ports.

Exit as soon as we fail to create or bind a socket for a given
forwarded port, and report the actual error.

Keep the current behaviour, however, in case the user wants to
forward all the (available) ports for a given protocol, or all the
ports with excluded ranges only. There, it's more reasonable that
the user is expecting partial failures, and it's probably convenient
that we continue with the ports we could forward.

Update the manual page to reflect the new behaviour, and the old
behaviour too in the cases where we keep it.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-16 08:47:14 +01:00
David Gibson
927cb84fff udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()
Usually automatically forwarded UDP outbound ports are set up by
udp_port_rebind_outbound() called from udp_timer().  However, the very
first time they're created and bound is by udp_sock_init_ns() called from
udp_init().  udp_sock_init_ns() is essentially an unnecessary cut down
version of udp_port_rebind_outbound(), so we can jusat remove it.

Doing so does require moving udp_init() below udp_port_rebind_outbound()'s
definition.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-14 03:24:23 +01:00
David Gibson
96ad5c5acd udp: Don't prematurely (and incorrectly) set up automatic inbound forwards
For automated inbound port forwarding in pasta mode we scan bound ports
within the guest namespace via /proc and bind matching ports on the host to
listen for packets.  For UDP this is usually handled by udp_timer() which
calls port_fwd_scan_udp() followed by udp_port_rebind().  However there's
one initial scan before the the UDP timer is started: we call
port_fwd_scan_udp() from port_fwd_init(), and actually bind the resulting
ports in udp_sock_init_init() called from udp_init().

Unfortunately, the version in udp_sock_init_init() isn't correct.  It
unconditionally opens a new socket for every forwarded port, even if a
socket has already been explicit created with the -u option.  If the
explicitly forwarded ports have particular configuration (such as a
specific bound address address, or one implied by the -o option) those will
not be replicated in the new socket.  We essentially leak the original
correctly configured socket, replacing it with one which might not be
right.

We could make udp_sock_init_init() use udp_port_rebind() to get that right,
but there's actually no point doing so:
 * The initial bind was introduced by ccf6d2a7b48d ("udp: Actually bind
   detected namespace ports in init namespace") at which time we didn't
   periodically scan for bound UDP ports.  Periodic scanning was introduced
   in 457ff122e ("udp,pasta: Periodically scan for ports to automatically
   forward") making the bind from udp_init() redundant.
 * At the time of udp_init(), programs in the guest namespace are likely
   not to have started yet (unless attaching a pre-existing namespace) so
   there's likely not anything to scan for anyway.

So, simply remove the initial, broken socket create/bind, allowing
automatic port forwards to be created the first time udp_timer() runs.

Reported-by: Laurent Jacquot <jk@lutty.net>
Suggested-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=79
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-14 03:24:01 +01:00
David Gibson
9f57983886 netlink: Use const rtnh pointer
6c7623d07 ("netlink: Add support to fetch default gateway from multipath
routes") inadvertently introduced a new cppcheck warning for a variable
which could be a const pointer but isn't.  This occurs with
cppcheck-2.13.0-1.fc39.x86_64 in Fedora 39 at least.

Fixes: 6c7623d07bbd ("netlink: Add support to fetch default gateway from multipath routes")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-14 01:10:47 +01:00
Stefano Brivio
7ee4e17267 log: setlogmask(0) can actually result in a system call, don't use it
Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic
memory allocation"), we didn't store the current log mask in a
variable, and we fetched it using setlogmask(0) wherever needed.

But after that commit, we can use our log_mask copy instead. And we
should: with recent glibc versions, setlogmask(0) actually results in
a system call, which causes a substantial overhead with high transfer
rates: we use setlogmask(0) even to decide we don't want to print
debug messages.

Now that we rely on log_mask in early stages, before setlogmask() is
called, we need to initialise that variable to the special LOG_EMERG
mask value right away: define LOG_EARLY to make this clearer, and,
while at it, group conditions in vlogmsg() into something more terse.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-14 01:10:11 +01:00
David Gibson
78901eeae3 tcp: Fix subtle bug in fast re-transmit path
When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack.  Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards.  Setting seq_ack_from_tap is
not correct, though.

In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above.  However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible.  In
that case this update will cause problems.

We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed).  We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest.  However tcp_sock_consume() failing means we weren't
able to disard the acked data.  To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough.  Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-11 18:30:01 +01:00
Stefano Brivio
6c7623d07b netlink: Add support to fetch default gateway from multipath routes
If the default route for a given IP version is a multipath one,
instead of refusing to start because there's no RTA_GATEWAY attribute
in the set returned by the kernel, we can just pick one of the paths.

To make this somewhat less arbitrary, pick the path with the highest
weight, if weights differ.

Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/20927
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-09 13:24:33 +01:00
David Gibson
322660b0b9 icmp: Dedicated functions for starting and closing ping sequences
ICMP sockets are cleaned up on a timeout implemented in icmp_timer_one(),
and the logic to do that cleanup is open coded in that function.  Similarly
new sockets are opened when we discover we don't have an existing one in
icmp_tap_handler(), and again the logic is open-coded.

That's not the worst thing, but it's a bit cleaner to have dedicated
functions for the creation and destruction of ping sockets.  This will also
make things a bit easier for future changes we have in mind.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:56 +01:00