1
0
mirror of https://passt.top/passt synced 2024-06-29 22:42:46 +00:00
Commit Graph

64 Commits

Author SHA1 Message Date
David Gibson
eed17a47fe Handle userns isolation and dropping root at the same time
passt/pasta can interact with user namespaces in a number of ways:
   1) With --netns-only we'll remain in our original user namespace
   2) With --userns or a PID option to pasta we'll join either the given
      user namespace or that of the PID
   3) When pasta spawns a shell or command we'll start a new user namespace
      for the command and then join it
   4) With passt we'll create a new user namespace when we sandbox()
      ourself

However (3) and (4) turn out to have essentially the same effect.  In both
cases we create one new user namespace.  The spawned command starts there,
and passt/pasta itself will live there from sandbox() onwards.

Because of this, we can simplify user namespace handling by moving the
userns handling earlier, to the same point we drop root in the original
namespace.  Extend the drop_user() function to isolate_user() which does
both.

After switching UID and GID in the original userns, isolate_user() will
either join or create the userns we require.  When we spawn a command with
pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns,
because we're already made one.  sandbox() likewise no longer needs to
create (or join) an userns because we're already in the one we need.

We no longer need c->pasta_userns_fd, since the fd is only used locally
in isolate_user().  Likewise we can replace c->netns_only with a local
in conf(), since it's not used outside there.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
d72a1e7bb9 Move self-isolation code into a separate file
passt/pasta contains a number of routines designed to isolate passt from
the rest of the system for security.  These are spread through util.c and
passt.c.  Move them together into a new isolation.c file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
80d7012b09 Consolidate determination of UID/GID to run as
Currently the logic to work out what UID and GID we will run as is spread
across conf().  If --runas is specified it's handled in conf_runas(),
otherwise it's handled by check_root(), which depends on initialization of
the uid and gid variables by either conf() itself or conf_runas().

Make this clearer by putting all the UID and GID logic into a single
conf_ugid() function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
10c6347747 Split checking for root from dropping root privilege
check_root() both checks to see if we are root (in the init namespace),
and if we are drops to an unprivileged user.  To make future cleanups
simpler, split the checking for root (now in check_root()) from the actual
dropping of privilege (now in drop_root()).

Note that this does slightly alter semantics.  Previously we would only
setuid() if we were originally root (in the init namespace).  Now we will
always setuid() and setgid(), though it won't actually change anything if
we weren't privileged to begin with.  This also means that we will now
always attempt to switch to the user specified with --runas, even if we
aren't (init namespace) root to begin with.  Obviously this will fail with
an error if we weren't privileged to start with.  --help and the man page
are updated accordingly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
David Gibson
7330ae3abf Don't store UID & GID persistently in the context structure
c->uid and c->gid are first set in conf(), and last used in check_root()
itself called from conf().  Therefore these don't need to be fields in the
long lived context structure and can instead be locals in conf().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-13 05:31:51 +02:00
Stefano Brivio
9672ab8dd0 util: Drop any supplementary group before dropping privileges
Commit a951e0b9ef ("conf: Add --runas option, changing to given UID
and GID if started as root") dropped the call to initgroups() that
used to add supplementary groups corresponding to the user we'll
eventually run as -- we don't need those.

However, if the original user belongs to supplementary groups
(usually not the case, if started as root), we don't drop those,
now, and rpmlint says:

  passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt
  passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt.avx2

Add a call to setgroups() with an empty set, to drop any
supplementary group we might currently have, before changing GID
and UID.

Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-08-30 19:15:44 +02:00
David Gibson
16f5586bb8 Make substructures for IPv4 and IPv6 specific context information
The context structure contains a batch of fields specific to IPv4 and to
IPv6 connectivity.  Split those out into a sub-structure.

This allows the conf_ip4() and conf_ip6() functions, which take the
entire context but touch very little of it, to be given more specific
parameters, making it clearer what it affects without stepping through the
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-07-30 22:14:07 +02:00
David Gibson
4b2e018d70 Allow different external interfaces for IPv4 and IPv6 connectivity
It's quite plausible for a host to have both IPv4 and IPv6 connectivity,
but only via different interfaces.  For example, this will happen in the
case that IPv6 connectivity is via a tunnel (e.g. 6in4 or 6rd).  It would
also happen in the case that IPv4 access is via a tunnel on an otherwise
IPv6 only local network, which is a setup that might become more common in
the post IPv4 address exhaustion world.

In turns out there's no real need for passt/pasta to get its IPv4 and IPv6
connectivity via the same interface, so we can handle this situation fairly
easily.  Change the core to allow eparate external interfaces for IPv4 and
IPv6.  We don't actually set these separately for now.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-07-30 21:50:41 +02:00
Stefano Brivio
f3198c4a06 util: Fix debug print on failed SO_REUSEADDR setting in sock_l4()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-07-14 01:36:05 +02:00
David Gibson
cbac0245c8 Remove unused line_read()
The old, ugly implementation of line_read() is no longer used.  Remove it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-07-06 08:10:55 +02:00
David Gibson
cf83df4574 Use new lineread implementation for procfs_scan_listen()
Use the new more solid implementation of line by line reading for
procfs_scan_listen().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-07-06 08:10:55 +02:00
Stefano Brivio
a951e0b9ef conf: Add --runas option, changing to given UID and GID if started as root
On some systems, user and group "nobody" might not be available. The
new --runas option allows to override the default "nobody" choice if
started as root.

Now that we allow this, drop the initgroups() call that was used to
add any additional groups for the given user, as that might now
grant unnecessarily broad permissions. For instance, several
distributions have a "kvm" group to allow regular user access to
/dev/kvm, and we don't need that in passt or pasta.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-05-19 16:27:20 +02:00
Stefano Brivio
3c6ae62510 conf, tcp, udp: Allow address specification for forwarded ports
This feature is available in slirp4netns but was missing in passt and
pasta.

Given that we don't do dynamic memory allocation, we need to bind
sockets while parsing port configuration. This means we need to
process all other options first, as they might affect addressing and
IP version support. It also implies a minor rework of how TCP and UDP
implementations bind sockets.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-05-01 07:19:05 +02:00
Stefano Brivio
22ed4467a4 treewide: Unchecked return value from library, CWE-252
All instances were harmless, but it might be useful to have some
debug messages here and there. Reported by Coverity.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-04-07 11:44:35 +02:00
Stefano Brivio
62c3edd957 treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
48582bf47f treewide: Mark constant references as const
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
bb70811183 treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.

Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.

This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
3e4c2d1098 util: Fix function declaration style of write_pidfile()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
3eb19cfd8a tcp, udp, util: Enforce 24-bit limit on socket numbers
This should never happen, but there are no formal guarantees: ensure
socket numbers are below SOCKET_MAX.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
e5eefe7743 tcp: Refactor to use events instead of states, split out spliced implementation
Using events and flags instead of states makes the implementation
much more straightforward: actions are mostly centered on events
that occurred on the connection rather than states.

An example is given by the ESTABLISHED_SOCK_FIN_SENT and
FIN_WAIT_1_SOCK_FIN abominations: we don't actually care about
which side started closing the connection to handle closing of
connection halves.

Split out the spliced implementation, as it has very little in
common with the "regular" TCP path.

Refactor things here and there to improve clarity. Add helpers
to trace where resets and flag settings come from.

No functional changes intended.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-28 17:11:40 +02:00
Stefano Brivio
d2e40bb8d9 conf, util, tap: Implement --trace option for extra verbose logging
--debug can be a bit too noisy, especially as single packets or
socket messages are logged: implement a new option, --trace,
implying --debug, that enables all debug messages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-25 13:21:13 +01:00
Stefano Brivio
6d661dc5b2 seccomp: Adjust list of allowed syscalls for armv6l, armv7l
It looks like glibc commonly implements clock_gettime(2) with
clock_gettime64(), and uses recv() instead of recvfrom(), send()
instead of sendto(), and sigreturn() instead of rt_sigreturn() on
armv6l and armv7l.

Adjust the list of system calls for armv6l and armv7l accordingly.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-26 23:39:19 +01:00
Stefano Brivio
0515adceaa passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.

While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.

With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.

While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.

We have to open all the file handles we'll ever need before
sandboxing:

- the packet capture file can only be opened once, drop instance
  numbers from the default path and use the (pre-sandbox) PID instead

- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
  of bound ports in pasta mode, are now opened only once, before
  sandboxing, and their handles are stored in the execution context

- the UNIX domain socket for passt is also bound only once, before
  sandboxing: to reject clients after the first one, instead of
  closing the listening socket, keep it open, accept and immediately
  discard new connection if we already have a valid one

Clarify the (unchanged) behaviour for --netns-only in the man page.

To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.

For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.

We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.

Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-21 13:41:13 +01:00
Stefano Brivio
80283e6aea util: Avoid return of possibly truncated unsigned long in bitmap_isset()
Oops. If *word & BITMAP_BIT(bit) is bigger than an int (which is the
case for half of the possible bits of a bitmap on 64-bit archs), we'll
return that as an int, that is, zero, even if the bit at hand is set.

Just return zero or one there, no callers are interested in the actual
bitmap as return value.

Issue found as pasta wouldn't automatically detect some bound ports.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-01 02:36:16 +01:00
Stefano Brivio
292c185553 passt: Address new clang-tidy warnings from LLVM 13.0.1
clang-tidy from LLVM 13.0.1 reports some new warnings from these
checkers:

- altera-unroll-loops, altera-id-dependent-backward-branch: ignore
  for the moment being, add a TODO item

- bugprone-easily-swappable-parameters: ignore, nothing to do about
  those

- readability-function-cognitive-complexity: ignore for the moment
  being, add a TODO item

- altera-struct-pack-align: ignore, alignment is forced in protocol
  headers

- concurrency-mt-unsafe: ignore for the moment being, add a TODO
  item

Fix bugprone-implicit-widening-of-multiplication-result warnings,
though, that's doable and they seem to make sense.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-01-30 02:59:12 +01:00
Stefano Brivio
caa22aa644 tcp, udp, util: Fixes for bitmap handling on big-endian, casts
Bitmap manipulating functions would otherwise refer to inconsistent
sets of bits on big-endian architectures. While at it, fix up a
couple of casts.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-01-26 16:30:59 +01:00
Stefano Brivio
4c7304db85 conf, pasta: Explicitly pass CLONE_{NEWUSER,NEWNET} to setns()
Only allow the intended types of namespaces to be joined via setns()
as a defensive measure.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-01-26 16:30:59 +01:00
Stefano Brivio
b93c2c1713 passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitions
This is the only remaining Linux-specific include -- drop it to avoid
clang-tidy warnings and to make code more portable.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-01-26 07:57:09 +01:00
Stefano Brivio
627e18fa8a passt: Add cppcheck target, test, and address resulting warnings
...mostly false positives, but a number of very relevant ones too,
in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-21 09:41:13 +02:00
Stefano Brivio
dd942eaa48 passt: Fix build with gcc 7, use std=c99, enable some more Clang checkers
Unions and structs, you all have names now.

Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.

Provide a ffsl() weak declaration using gcc built-in.

Start reordering includes, but that's not enough for the
llvm-include-order checker yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-21 04:26:08 +02:00
Stefano Brivio
a20626fb35 util: Go to next non-empty line, skip newlines in line_read()
Otherwise, we'll stop returning lines at the first empty line
in a file -- this is not expected in case of e.g. /etc/resolv.conf.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 11:39:08 +02:00
Stefano Brivio
12cfa6444c passt: Add clang-tidy Makefile target and test, take care of warnings
Most are just about style and form, but a few were actually
serious mistakes (NDP-related).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:34:22 +02:00
Stefano Brivio
2c7d1ce088 passt: Static builds: don't redefine __vsyslog(), skip getpwnam() and initgroups()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 16:53:40 +02:00
Stefano Brivio
1fd0c9b0e1 util, pasta: Don't read() and lseek() every single line in read_line()
...periodically checking bound ports becomes quite expensive
otherwise.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-16 00:49:33 +02:00
Stefano Brivio
a56721b61c util: Don't duplicate debug messages, they're already on stderr
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-15 20:46:17 +02:00
Stefano Brivio
32d07f5e59 passt, pasta: Completely avoid dynamic memory allocation
Replace libc functions that might dynamically allocate memory with own
implementations or wrappers.

Drop brk(2) from list of allowed syscalls in seccomp profile.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:16:03 +02:00
Stefano Brivio
66d5930ec7 passt, pasta: Add seccomp support
List of allowed syscalls comes from comments in the form:
	#syscalls <list>

for syscalls needed both in passt and pasta mode, and:
	#syscalls:pasta <list>
	#syscalls:passt <list>

for syscalls specifically needed in pasta or passt mode only.

seccomp.sh builds a list of BPF statements from those comments,
prefixed by a binary search tree to keep lookup fast.

While at it, clean up a bit the Makefile using wildcards.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:46 +02:00
Stefano Brivio
44ca4bcf3e util: Fix comment to bitmap_clear()
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Stefano Brivio
675174d4ba conf, tap: Split netlink and pasta functions, allow interface configuration
Move netlink routines to their own file, and use netlink to configure
or fetch all the information we need, except for the TUNSETIFF ioctl.

Move pasta-specific functions to their own file as well, add
parameters and calls to configure the tap interface in the namespace.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-14 13:15:12 +02:00
Giuseppe Scrivano
9a175cc2ce pasta: Allow specifying paths and names of namespaces
Based on a patch from Giuseppe Scrivano, this adds the ability to:

- specify paths and names of target namespaces to join, instead of
  a PID, also for user namespaces, with --userns

- request to join or create a network namespace only, without
  entering or creating a user namespace, with --netns-only

- specify the base directory for netns mountpoints, with --nsrun-dir

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[sbrivio: reworked logic to actually join the given namespaces when
 they're not created, implemented --netns-only and --nsrun-dir,
 updated pasta demo script and man page]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-07 04:05:15 +02:00
Stefano Brivio
d4d61480b6 tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-05 20:02:03 +02:00
Stefano Brivio
9657b6ed05 conf, tcp: Periodic detection of bound ports for pasta port forwarding
Detecting bound ports at start-up time isn't terribly useful: do this
periodically instead, if configured.

This is only implemented for TCP at the moment, UDP is somewhat more
complicated: leave a TODO there.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 11:23:44 +02:00
Stefano Brivio
e69e13671d util: Fix parsing of next option in ipv6_l4hdr()
We need to update next header and header length as soon as we meet
a new option header.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:09 +02:00
Stefano Brivio
1e49d194d0 passt, pasta: Introduce command-line options and port re-mapping
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-01 17:00:27 +02:00
Stefano Brivio
ce24fe0b3f util: Don't close ping sockets if bind() fails
...they're still usable, thanks to the workaround implemented in
icmp_tap_handler().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-08-04 01:44:58 +02:00
Stefano Brivio
a340e5336d util: Fix millisecond logging timestamp calculation
Four sub-second digits means 0.1ms units: divide nanoseconds by
10^5, not 10^6.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-08-04 01:39:00 +02:00
Stefano Brivio
86b273150a tcp, udp: Allow binding ports in init namespace to both tap and loopback
Traffic with loopback source address will be forwarded to the direct
loopback connection in the namespace, and the tap interface is used
for the rest.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-26 14:10:29 +02:00
Stefano Brivio
17765f8de0 checksum: Introduce AVX2 implementation, unify helpers
Provide an AVX2-based function using compiler intrinsics for
TCP/IP-style checksums. The load/unpack/add idea and implementation
is largely based on code from BESS (the Berkeley Extensible Software
Switch) licensed as 3-Clause BSD, with a number of modifications to
further decrease pipeline stalls and to minimise cache pollution.

This speeds up considerably data paths from sockets to tap
interfaces, decreasing overhead for checksum computation, with
16-64KiB packet buffers, from approximately 11% to 7%. The rest is
just syscalls at this point.

While at it, provide convenience targets in the Makefile for avx2,
avx2_debug, and debug targets -- these simply add target-specific
CFLAGS to the build.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-26 07:18:50 +02:00
Stefano Brivio
64a0ba3b27 udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket
Packets are received directly onto pre-cooked, static buffers
for IPv4 (with partial checksum pre-calculation) and IPv6 frames,
with pre-filled Ethernet addresses and, partially, IP headers,
and sent out from the same buffers with sendmmsg(), for both
passt and pasta (non-local traffic only) modes.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-21 12:01:04 +02:00
Stefano Brivio
33482d5bf2 passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:

	$ unshare -rUn
	# echo $$
	1871759

	$ ./pasta 1871759	# From another terminal

	# udhcpc -i pasta0 2>/dev/null
	# ping -c1 pasta.pizza
	PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
	64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms

	--- pasta.pizza ping statistics ---
	1 packets transmitted, 1 received, 0% packet loss, time 0ms
	rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
	# ping -c1 spaghetti.pizza
	PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
	64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms

	--- spaghetti.pizza ping statistics ---
	1 packets transmitted, 1 received, 0% packet loss, time 0ms
	rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms

This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.

Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:

	# ip link set dev lo up
	# iperf3 -s

	$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
	[SUM]   0.00-10.00  sec  52.3 GBytes  44.9 Gbits/sec  283             sender
	[SUM]   0.00-10.43  sec  52.3 GBytes  43.1 Gbits/sec                  receiver

	iperf Done.

epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.

A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 11:04:22 +02:00