1
0
mirror of https://passt.top/passt synced 2025-01-27 14:45:17 +00:00

1044 Commits

Author SHA1 Message Date
David Gibson
1a3ade9037 nstool: Enter holder's cwd when changing mount ns with nstool exec
If we enter a mount namespace with nstool exec our working directory will
be changed to / in the new mount ns.  This is surprising if we haven't
actually altered any mounts yet in the new ns.  Instead, change the working
directory to match that of the holder process in this situation.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:12 +02:00
David Gibson
98031bee73 nstool: Advertise the holder's cwd (in its mountns) across the socket
This is possible useful in nstool info and has further uses for nstool
exec.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:10 +02:00
David Gibson
469b69aaa1 test: Use "nstool exec" to slightly simplify tests
Using this, rather than using "nstool info" to get the pid then manually
connecting with nsenter makes things a little simpler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:08 +02:00
David Gibson
3372cd0902 test: Initialise ${TRACE} properly
Unlike ${DEBUG} we don't initialize ${TRACE} to 0 if not set, which cases
failures when testing it later.  That failure acts as though it is false,
however it emits spurious errors in script.log, which can make it harder to
spot real errors.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:05 +02:00
David Gibson
329149d51a nstool: Add --keep-caps option to nstool exec
This allows you to run commands within a user namespace with the
privilege that comes from owning that userns.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:03 +02:00
David Gibson
0b66944648 nstool: Add nstool exec command to execute commands in an nstool namespace
This combines nstool info -pw <sock> with nsenter with various options for
a more convenient and less verbose of entering existing nstool managed
namespaces.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:12:01 +02:00
David Gibson
3bcbca5db8 nstool: Helpers to iterate through namespace types
Will make things a bit less verbose in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:59 +02:00
David Gibson
f6a9ea3af5 nstool: Add magic number to advertized information
So that we'll probably give a better error if you point it at something
that's not an nstool hold control socket.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:57 +02:00
David Gibson
4311066bdb nstool: Detect what namespaces target is in
Give nstool the ability to detect what namespaces the target process is in,
relative to where it's called.  That is, those namespace types for which
the target is not in the same namespace as the caller.  For now, just
print this information with "info", which can be useful for debugging.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:55 +02:00
David Gibson
fd4a752e92 nstool: Replace "pid" subcommand with "info" subcommand
The new subcommand gives more information about the holder process and its
namespace, and may be further extended in future.  Add some options which
give the old behaviour for existing scripts.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:52 +02:00
David Gibson
a4b017d91c nstool: Split some command line parsing and socket setup to subcommands
This will make it easier to differentiate the options to those commands
further in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:50 +02:00
David Gibson
42fb218347 nstool: Move description of its operation modes from comment to usage
Easier to see it there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:48 +02:00
David Gibson
2884ccd2e7 nstool: Reverse parameters to nstool
Having the "subcommand" first is more conventional and will make it more
natural for future extensions I have planned.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:43 +02:00
David Gibson
4914fce77b nstool: Rename nsholder to nstool
In preparation for extending what it does.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:41 +02:00
David Gibson
55bbe3dbcb test: Remove race between commands run in the same context
context_run() has a race condition if two commands are run in close
proximity (generally involving at least one in the background).  Because we
always use the same name for the temporary fifo files, if another command
is issued while the fifos for the first still exist, mkfifo will fail,
typically causing the entire test script to jam.

Create unique names for the temporary fifos to avoid this problem.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-08 01:11:36 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
Stefano Brivio
b10b983fbd fedora: Adjust path for SELinux policy and interface file to latest guidelines
Forget about:
  https://fedoraproject.org/wiki/SELinux_Policy_Modules_Packaging_Draft

and:
  https://fedoraproject.org/wiki/PackagingDrafts/SELinux_Independent_Policy

The guidelines to follow are:
  https://fedoraproject.org/wiki/SELinux/IndependentPolicy

Start from fixing the most pressing issue, that is, a path conflict
with policy-selinux-devel about passt.if, and, while at it, adjust
the installation paths for policy files too.

Reported-by: Xose Vazquez Perez <xose.vazquez@gmail.com>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182476
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023_03_29.b10b983
2023-03-29 22:11:07 +02:00
Stefano Brivio
387f4aca74 fedora: Don't install useless SELinux interface file for pasta
That was meant to be an example, and I just dropped it in the
previous commit -- passt.if should be more than enough as a possible
example.

Reported-by: Carl G. <carlg@fedoraproject.org>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182145
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:48:12 +02:00
Stefano Brivio
dafd92d555 selinux: Drop useless interface file for pasta
This was meant to be an example, but I managed to add syntax errors
to it. Drop it altogether.

Reported-by: Carl G. <carlg@fedoraproject.org>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182145
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:48:12 +02:00
Stefano Brivio
98a9a7d9e5 conf: Allow binding to ports on an interface without a specific address
Somebody might want to bind listening sockets to a specific
interface, but not a specific address, and there isn't really a
reason to prevent that. For example:

  -t %eth0/2022

Alternatively, we support options such as -t 0.0.0.0%eth0/2022 and
-t ::%eth0/2022, but not together, for the same port.

Enable this kind of syntax and add examples to the man page.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/14425#issuecomment-1485192195
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:48:12 +02:00
Stefano Brivio
33d88f79d9 tcp: Clear ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
Since commit cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
process an ACK segment, but, more correctly, only if we're really not
waiting for a further ACK segment, that is, only if the acknowledged
sequence matches what we sent.

In the new function implementing this, tcp_update_seqack_from_tap(),
we also reset the retransmission counter and store the updated ACK
sequence. Both should be done iff forward progress is acknowledged,
implied by the fact that the new ACK sequence is greater than the
one we previously stored.

At that point, it looked natural to also include the statements that
clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
block: if we're not making forward progress, the need for an ACK, or
lack thereof, should remain unchanged.

There might be cases where this isn't true, though: without the
previous commit 4e73e9bd655c ("tcp: Don't special case the handling
of the ack of a syn"), this would happen if a tap-side client
initiated a connection, and the server didn't send any data.

At that point we would never, in the established state of the
connection, call tcp_update_seqack_from_tap() with reported forward
progress.

That issue itself is fixed by the previous commit, now, but clearing
ACK_FROM_TAP_DUE only on ACK sequence progress doesn't really follow
any logic.

Clear the ACK_FROM_TAP_DUE flag regardless of reported forward
progress. If we clear it when it's already unset, conn_flag() will do
nothing with it.

This doesn't fix any known functional issue, rather a conceptual one.

Fixes: cc6d8286d104 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:47:17 +02:00
David Gibson
4e73e9bd65 tcp: Don't special case the handling of the ack of a syn
TCP treats the SYN packets as though they occupied 1 byte in the logical
data stream described by the sequence numbers.  That is, the very first ACK
(or SYN-ACK) each side sends should acknowledge a sequence number one
greater than the initial sequence number given in the SYN or SYN-ACK it's
responding to.

In passt we were tracking that by advancing conn->seq_to_tap by one when
we send a SYN or SYN-ACK (in tcp_send_flag()).  However, we also
initialized conn->seq_ack_from_tap, representing the acks we've already
seen from the tap side, to ISN+1, meaning we treated it has having
acknowledged the SYN before it actually did.

There were apparently reasons for this in earlier versions, but it causes
problems now.  Because of this when we actually did receive the initial ACK
or SYN-ACK, we wouldn't see the acknoweldged serial number as advancing,
and so wouldn't clear the ACK_FROM_TAP_DUE flag.

In most cases we'd get away because subsequent packets would clear the
flag.  However if one (or both) sides didn't send any data, the other side
would (correctly) keep sending ISN+1 as the acknowledged sequence number,
meaning we would never clear the ACK_FROM_TAP_DUE flag.  That would mean
we'd treat the connection as if we needed to retransmit (although we had
0 bytes to retransmit), and eventaully (after around 30s) reset the
connection due to too many retransmits.  Specifically this could cause the
iperf3 throughput tests in the testsuite to fail if set for a long enough
test period.

Correct this by initializing conn->seq_ack_from_tap to the ISN and only
advancing it when we actually get the first ACK (or SYN-ACK).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:47:07 +02:00
David Gibson
085672f77c tcp: Clarify allowed state for tcp_data_from_tap()
Comments suggest that this should only be called for an ESTABLISHED
connection.  However, it's non-trivial to ascertain that from the actual
control flow in the caller.  Add an ASSERT() to make it very clear that
this is only called in ESTABLISHED state.

In fact, there were some circumstances where it could be called on a CLOSED
connection.  In a sense that is "established", but with that assert this
does require specific (trivial) handling to avoid a spurious abort().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:46:58 +02:00
Stefano Brivio
1ee2f7cada tcp: Don't reset ACK_TO_TAP_DUE on any ACK, reschedule timer as needed
This is mostly symmetric with commit cc6d8286d104 ("tcp: Reset
ACK_FROM_TAP_DUE flag only as needed, update timer"): we shouldn't
reset the ACK_TO_TAP_DUE flag on any inbound ACK segment, but only
once we acknowledge everything we received from the guest or the
container.

If we don't, a client might unnecessarily hold off further data,
especially during slow start, and in general we won't converge to the
usable bandwidth.

This is very visible especially with traffic tests on links with
non-negligible latency, such as in the reported issue. There, a
public iperf3 server sometimes aborts the test due do what appears to
be a low iperf3's --rcv-timeout (probably less than a second). Even
if this doesn't happen, the throughput will converge to a fraction of
the usable bandwidth.

Clear ACK_TO_TAP_DUE if we acknowledged everything, set it if we
didn't, and reschedule the timer in case the flag is still set as the
timer expires.

While at it, decrease the ACK timer interval to 10ms.

A 50ms interval is short enough for any bandwidth-delay product I had
in mind (local connections, or non-local connections with limited
bandwidth), but here I am, testing 1gbps transfers to a peer with
100ms RTT.

Indeed, we could eventually make the timer interval dependent on the
current window and estimated bandwidth-delay product, but at least
for the moment being, 10ms should be long enough to avoid any
measurable syscall overhead, yet usable for any real-world
application.

Reported-by: Lukas Mrtvy <lukas.mrtvy@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=44
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023_03_21.1ee2f7c
2023-03-21 23:14:58 +01:00
Stefano Brivio
9ffccf7acc tcp: When a connection flag it set, don't negate it for debug print
Fix a copy and paste typo I added in commit 5474bc5485d8 ("tcp,
tcp_splice: Get rid of false positive CWE-394 Coverity warning from
fls()") and --debug altogether.

Fixes: 5474bc5485d8 ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 19:39:55 +01:00
David Gibson
89d1494974 Fix false positive if cppcheck doesn't give a false positive
da46fdac "tcp: Suppress knownConditionTrueFalse cppcheck false positive"
introduced a suppression to work around a cppcheck bug causing a false
positive warning.  However, the suppression will itself cause a spurious
unmatchedSuppression warning if used with a version of cppcheck from before
the bug was introduced.  That includes the packaged version of cppcheck in
Fedora.

Suppress the unmatchedSuppression as well.

Fixes: da46fdac3605 ("tcp: Suppress knownConditionTrueFalse cppcheck false positive")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:38:44 +01:00
David Gibson
34ade90957 Work around weird false positives with cppcheck-2.9.1
Commit 89e38f55 "treewide: Fix header includes to build with musl" added
extra #includes to work with musl.  Unfortunately with the cppcheck version
I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false
positives: specifically cppcheck seems to hit a #error in <bits/unistd.h>
complaining about including it directly instead of via <unistd.h> (which is
not something we're doing).

I have no idea why that would be happening; but I'm guessing it has to be
a bug in the cpp implementation in that cppcheck version.  In any case,
it's possible to work around this by moving the include of <unistd.h>
before the include of <signal.h>.  So, do that.

Fixes: 89e38f55405d ("treewide: Fix header includes to build with musl")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:38:06 +01:00
Stefano Brivio
ccf6d2a7b4 udp: Actually bind detected namespace ports in init namespace
When I reworked udp_init() to move most of the port binding logic
to conf_ports, I accidentally dropped this bit of automatic port
detection (and binding) at start-up.

On -U auto, in pasta mode, udp_sock_init_ns() binds ports in the
namespace that correspond to ports bound in the init namespace,
but on -u auto, nothing actually happens after port detection.

Add udp_sock_init_init() to deal with this, and while at it fix
the comment to udp_sock_init_ns(): the latter takes care of
outbound "connections".

This is currently not covered by tests, and the UDP port needs to
be already bound in the namespace when pasta starts (periodic
detection for UDP is a missing feature at the moment). It can be
checked like this:

  $ unshare -rUn
  # echo $$
  590092
  # socat -u UDP-LISTEN:5555 STDOUT

  $ pasta -q -u auto 590092
  $ echo "test" | socat -u STDIN UDP:localhost:5555

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 16:19:04 +01:00
Paul Holzinger
418f75ac37 pasta: fix tcp port forwarding in auto mode
The logic in tcp_timer() was inverted. fwd_out should expose the host
ports in the ns. Therfore it must read the ports on the host and then
bind them in the netns. The same for fwd_in which checks ports in the
ns and then exposes them on the host.

Note that this only fixes tcp ports, udp does not seems to work at all
right now with the auto mode.

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1128fa03fe73 ("Improve types and names for port forwarding configuration")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 14:18:54 +01:00
Stefano Brivio
dd23496619 fedora: Refresh SELinux labels in scriptlets, require -selinux package
Instead of:
  https://fedoraproject.org/wiki/SELinux_Policy_Modules_Packaging_Draft

follow this:
  https://fedoraproject.org/wiki/PackagingDrafts/SELinux_Independent_Policy

which seems to make more sense and fixes the issue that, on a fresh
install, without a reboot, the file contexts for the binaries are not
actually updated.

In detail:

- labels are refreshed using the selinux_relabel_pre and
  selinux_relabel_post on install, upgrade, and uninstall

- use the selinux_modules_install and selinux_modules_uninstall
  macros, instead of calling 'semodule' directly (no functional
  changes in our case)

- require the -selinux package on SELinux-enabled environments and if
  the current system policy is "targeted"

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023_03_17.dd23496
2023-03-17 08:26:07 +01:00
Stefano Brivio
87a655045b Makefile: Enable external override for TARGET
A cross-architecture build might pass a target-specific CC on 'make',
and not on 'make install', and this is what happens in Debian
cross-qa tests.

Given that we select binaries to be installed depending on the target
architecture, this means we would build AVX2 binaries in any case on
a x86_64 build machine.

By overriding TARGET in package build rules, we can tell the Makefile
about the target architecture, also for the 'install' (Makefile)
target.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-17 08:26:07 +01:00
Stefano Brivio
7727804658 passt.1: Fix description of --mtu option
By default, 65520 bytes are advertised, and zero disables DHCP and
NDP options.

Fixes: ec2b58ea4dc4 ("conf, dhcp, ndp: Fix message about default MTU, make NDP consistent")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-17 08:26:07 +01:00
Stefano Brivio
4e6178fd46 log: Avoid time_t/__syscall_slong_t format mismatch with long int on X32 ABI
On X32 (ILP32 using AMD64 system call ABI) and glibc,
struct timespec::tv_nsec is __syscall_slong_t and not a long int, see
also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 and
timespec(3type). Fine, we could cast that down to long and be done
with it.

But it turns out that also time_t (not guaranteed to be equivalent to
any type) is a long long int, and there we can't downcast.

To keep it simple, cast both to long long int, and change formats to
%lli, to avoid format warnings from gcc.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-17 08:25:56 +01:00
Stefano Brivio
70c0765b49 fedora: Install SELinux interface files to shared include directory
Link: https://github.com/fedora-selinux/selinux-policy/pull/1613
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023_03_10.70c0765
2023-03-10 20:01:41 +01:00
Stefano Brivio
93105ea066 contrib/selinux: Split interfaces into smaller bits
...to fit accepted Fedora practices.

Link: https://github.com/fedora-selinux/selinux-policy/pull/1613
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-10 20:01:41 +01:00
Stefano Brivio
dcdc50fc22 contrib/selinux: Drop unused passt_read_data() interface
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-10 20:01:41 +01:00
Stefano Brivio
9f35cf0b11 contrib/selinux: Drop "example" from headers: this is the actual policy
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-10 20:01:41 +01:00
Stefano Brivio
7c7625ddff README: Update Features section, plus minor improvements
...it's been a while.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023_03_09.7c7625d
2023-03-09 03:44:21 +01:00
Stefano Brivio
294d6dc4c6 contrib: Drop libvirt out-of-tree patch, integration mostly works in 9.1.0
...and in any case, this patch doesn't offer any advantage over the
current upstream integration.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
42fb62516d contrib: Drop QEMU out-of-tree patches
Native support was introduced with commit 13c6be96618c, QEMU 7.2.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
f3cd0f9e45 contrib: Drop Podman out-of-tree patch, integration is upstream now
See https://github.com/containers/podman/pull/16141, shipped in
Podman 4.4.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
d7272f1df8 tcp: Clamp MSS value when queueing data to tap, also for pasta
Tom reports that a pattern of repated ~1 MiB chunks downloads over
NNTP over TLS, on Podman 4.4 using pasta as network back-end, results
in pasta taking one full CPU thread after a while, and the download
never succeeds.

On that setup, we end up re-sending the same frame over and over,
with a consistent 65 534 bytes size, and never get an
acknowledgement from the tap-side client. This only happens for the
default MTU value (65 520 bytes) or for values that are slightly
smaller than that (down to 64 499 bytes).

We hit this condition because the MSS value we use in
tcp_data_from_sock(), only in pasta mode, is simply clamped to
USHRT_MAX, and not to the actual size of the buffers we pre-cooked
for sending, which is a bit less than that.

It looks like we got away with it until commit 0fb7b2b9080a ("tap:
Use different io vector bases depending on tap type") fixed the
setting of iov_len.

Luckily, since it's pasta, we're queueing up to two frames at a time,
so the worst that can happen is a badly segmented TCP stream: we
always have some space at the tail of the buffer.

Clamp the MSS value to the appropriate maximum given by struct
tcp{4,6}_buf_data_t, no matter if we're running in pasta or passt
mode.

While at it, fix the comments to those structs to reflect the current
struct size. This is not really relevant for any further calculation
or consideration, but it's convenient to know while debugging this
kind of issues.

Thanks to Tom for reporting the issue in a very detailed way and for
providing a test setup.

Reported-by: Tom Mombourquette <tom@devnode.com>
Link: https://github.com/containers/podman/issues/17703
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
bb2b67cb35 conf: Terminate on EMFILE or ENFILE on sockets for port mapping
In general, we don't terminate or report failures if we fail to bind
to some ports out of a given port range specifier, to allow users to
conveniently specify big port ranges (or "all") without having to
care about ports that might already be in use.

However, running out of the open file descriptors quota is a
different story: we can't do what the user requested in a very
substantial way.

For example, if the user specifies '-t all' and we can only bind
1024 sockets, the behaviour is rather unexpected.

Fail whenever socket creation returns -ENFILE or -EMFILE.

Link: https://bugs.passt.top/show_bug.cgi?id=27
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
5aea2f88ab tcp, udp: Fix partial success return codes in {tcp,udp}_sock_init()
The comments say we should return 0 on partial success, and an error
code on complete failure. Rationale: if the user configures a port
forwarding, and we succeed to bind that port for IPv4 or IPv6 only,
that might actually be what the user intended.

Adjust the two functions to reflect the comments.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
73992c42ce tcp, udp, util: Pass socket creation errors all the way up
...starting from sock_l4(), pass negative error (errno) codes instead
of -1. They will only be used in two commits from now, no functional
changes intended here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
50687616e4 util: Carry own definition of __bswap_constant{16,32}
musl doesn't define those, use our own definition there. This is a
trivial implementation, similar to what's shipped by e.g. uClibc,
glibc, libiio.

Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Chris Kuhn
89e38f5540 treewide: Fix header includes to build with musl
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.

Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.

Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Chris Kuhn
5c58feab7b conf, passt: Rename stderr to force_stderr
While building against musl, gcc informs us that 'stderr' is a
protected keyword. This probably comes from a #define stderr (stderr)
in musl's stdio.h, to avoid a clash with extern FILE *const stderr,
but I didn't really track it down. Just rename it to force_stderr, it
makes more sense.

[sbrivio: Added commit message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
fde8004ab0 netlink: Use 8 KiB * netlink message header size as response buffer
...instead of BUFSIZ. On musl, BUFSIZ is 1024, so we'll typically
truncate the response to the request we send in nl_link(). It's
usually 8192 or more with glibc.

There doesn't seem to be any macro defining the rtnetlink maximum
message size, and iproute2 just hardcodes 1024 * 1024 for the receive
buffer, but the example in netlink(7) makes somewhat sense, looking
at the kernel implementation.

It's not very clean, but we're very unlikely to hit that limit,
and if we do, we'll find out painlessly, because NLA_OK() will tell
us right away.

Reported-by: Chris Kuhn <kuhnchris+passt@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
a9c59dd91b conf, icmp, tcp, udp: Add options to bind to outbound address and interface
I didn't notice earlier: libslirp (and slirp4netns) supports binding
outbound sockets to specific IPv4 and IPv6 addresses, to force the
source addresse selection. If we want to claim feature parity, we
should implement that as well.

Further, Podman supports specifying outbound interfaces as well, but
this is simply done by resolving the primary address for an interface
when the network back-end is started. However, since kernel version
5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for
non-root users"), we can actually bind to a specific interface name,
which doesn't need to be validated in advance.

Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses,
and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets
to given interfaces.

Given that it probably makes little sense to select addresses and
routes from interfaces different than the ones given for outbound
sockets, also assign those as "template" interfaces, by default,
unless explicitly overridden by '-i'.

For ICMP and UDP, we call sock_l4() to open outbound sockets, as we
already needed to bind to given ports or echo identifiers, and we
can bind() a socket only once: there, pass address (if any) and
interface (if any) for the existing bind() and setsockopt() calls.

For TCP, in general, we wouldn't otherwise bind sockets. Add a
specific helper to do that.

For UDP outbound sockets, we need to know if the final destination
of the socket is a loopback address, before we decide whether it
makes sense to bind the socket at all: move the block mangling the
address destination before the creation of the socket in the IPv4
path. This was already the case for the IPv6 path.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:43:59 +01:00