1
0
mirror of https://passt.top/passt synced 2025-01-06 21:05:21 +00:00
Commit Graph

167 Commits

Author SHA1 Message Date
David Gibson
6e1e44293e ndp: Send unsolicited Router Advertisements
Currently, our NDP implementation only sends Router Advertisements (RA)
when it receives a Router Solicitation (RS) from the guest.  However,
RFC 4861 requires that we periodically send unsolicited RAs.

Linux as a guest also requires this: it will send an RS when a link first
comes up, but the route it gets from this will have a finite lifetime (we
set this to 65535s, the maximum allowed, around 18 hours).  When that
expires the guest will not send a new RS, but instead expects the route to
have been renewed (if still valid) by an unsolicited RA.

Implement sending unsolicited RAs on a partially randomised timer, as
required by RFC 4861.  The RFC also specifies that solicited RAs should
also be delayed, or even omitted, if the next unsolicited RA is soon
enough.  For now we don't do that, always sending an immediate RA in
response to an RS.  We can get away with this because in our use cases
we expect to just have passt itself and the guest on the link, rather than
a large broadcast domain.

Link: https://github.com/kubevirt/kubevirt/issues/13191
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:40 +01:00
David Gibson
b39760cc7d passt: Seed libc's pseudo random number generator
We have an upcoming case where we need pseudo-random numbers to scatter
timings, but we don't need cryptographically strong random numbers.  libc's
built in random() is fine for this purpose, but we should seed it.  Extend
secret_init() - the only current user of random numbers - to do this as
well as generating the SipHash secret.  Using /dev/random for a PRNG seed
is probably overkill, but it's simple and we only do it once, so we might
as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:38 +01:00
David Gibson
71d5deed5e util: Add general low-level random bytes helper
Currently secret_init() open codes getting good quality random bytes from
the OS, either via getrandom(2) or reading /dev/random.  We're going to
add at least one more place that needs random data in future, so make a
general helper for getting random bytes.  While we're there, fix a number
of minor bugs:
 - getrandom() can theoretically return a "short read", so handle that case
 - getrandom() as well as read can return a transient EINTR
 - We would attempt to read data from /dev/random if we failed to open it
   (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok)
 - More specific error reporting

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:36 +01:00
Stefano Brivio
71869e2912 passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:

/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
  314 |         nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
      |                                                            ^

We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:52 +01:00
Stefano Brivio
099ace64ce treewide: Address cert-err33-c clang-tidy warnings for clock and timer functions
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.

As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.

For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
2aea1da143 treewide: Allow additional system calls for i386/i686
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.

Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:

  https://wiki.debian.org/ReleaseGoals/64bit-time

that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.

Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).

I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.

Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:00:43 +02:00
David Gibson
905ecd2b0b treewide: Rename MAC address fields for clarity
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it.  Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac".  Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:54 +02:00
Stefano Brivio
09603cab28 passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.

This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.

Use close_range(2) to close all open files except for standard streams
and the one from --fd.

Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.

As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-08-08 21:31:25 +02:00
Stefano Brivio
ee36266a55 log, passt: Keep printing to stderr when passt is running in foreground
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).

But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.

Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.

Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:03:48 +02:00
Stefano Brivio
77c092ee5e log: Fetch log times with CLOCK_MONOTONIC, not CLOCK_REALTIME
We report relative timestamps in logs, so we want to avoid jumps in
the system time.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:36 +02:00
Stefano Brivio
e5c37ba0f4 log: Initialise timestamp for relative log time also if we use a log file
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:47:34 +02:00
David Gibson
882599e180 udp: Rename UDP listening sockets
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows.  Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match.  Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:34:01 +02:00
David Gibson
e0647ad80c udp: Handle "spliced" datagrams with per-flow sockets
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it.  Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.

For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow.  We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else.  We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c

When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:42 +02:00
Stefano Brivio
e7323e515a conf, passt: Don't call __openlog() if a log file is used
If a log file is configured, we would otherwise open a connection to
the system logger (if any), print any message that we might have
before we initialise the log file, and then keep that connection
around for no particular reason.

Call __openlog() as an alternative to the log file setup, instead.

This way, we might skip printing some messages during the
initialisation phase, but they're probably not really valuable to
have in a system log, and we're going to print them to standard
error anyway.

Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:46 +02:00
Stefano Brivio
dba7f0f5ce treewide: Replace strerror() calls
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.

While at it, convert a few error messages from a scant perror style
to proper failure descriptions.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:44 +02:00
Stefano Brivio
92a22fef93 treewide: Replace perror() calls with calls to logging functions
perror() prints directly to standard error, but in many cases standard
error might be already closed, or we might want to skip logging, based
on configuration. Our logging functions provide all that.

While at it, make errors more descriptive, replacing some of the
existing basic perror-style messages.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:43 +02:00
Stefano Brivio
afd9cdc9bb log, passt: Always print to stderr before initialisation is complete
After commit 15001b39ef ("conf: set the log level much earlier"), we
had a phase during initialisation when messages wouldn't be printed to
standard error anymore.

Commit f67238aa86 ("passt, log: Call __openlog() earlier, log to
stderr until we detach") fixed that, but only for the case where no
log files are given.

If a log file is configured, vlogmsg() will not call passt_vsyslog(),
but during initialisation, LOG_PERROR is set, so to avoid duplicated
prints (which would result from passt_vsyslog() printing to stderr),
we don't call fprintf() from vlogmsg() either.

This is getting a bit too complicated. Instead of abusing LOG_PERROR,
define an internal logging flag that clearly represents that we're not
done with the initialisation phase yet.

If this flag is not set, make sure we always print to stderr, if the
log mask matches.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:34 +02:00
Stefano Brivio
bca0fefa32 conf, passt: Make --stderr do nothing, and deprecate it
The original behaviour of printing messages to standard error by
default when running from a non-interactive terminal was introduced
because the first KubeVirt integration draft used to start passt in
foreground and get messages via standard error.

For development purposes, the system logger was more convenient at
that point, and passt was running from interactive terminals only if
not started by the KubeVirt integration.

This behaviour was introduced by 84a62b79a2 ("passt: Also log to
stderr, don't fork to background if not interactive").

Later, I added command-line options in 1e49d194d0 ("passt, pasta:
Introduce command-line options and port re-mapping") and accidentally
reversed this condition, which wasn't a problem as --stderr could
force printing to standard error anyway (and it was used by KubeVirt).

Nowadays, the KubeVirt integration uses a log file (requested via
libvirt configuration), and the same applies for Podman if one
actually needs to look at runtime logs. There are no use cases left,
as far as I know, where passt runs in foreground in non-interactive
terminals.

Seize the chance to reintroduce some sanity here. If we fork to
background, standard error is closed, so --stderr is useless in that
case.

If we run in foreground, there's no harm in printing messages to
standard error, and that accidentally became the default behaviour
anyway, so --stderr is not needed in that case.

It would be needed for non-interactive terminals, but there are no
use cases, and if there were, let's log to standard error anyway:
the user can always redirect standard error to /dev/null if needed.

Before we're up and running, we need to print to standard error anyway
if something happens, otherwise we can't report failure to start in
any kind of usage, stand-alone or in integrations.

So, make --stderr do nothing, and deprecate it.

While at it, drop a left-over comment about --foreground being the
default only for interactive terminals, because it's not the case
anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:28 +02:00
Stefano Brivio
b74801645c conf, passt: Don't try to log to stderr after we close it
If we don't run in foreground, we close standard error as we
daemonise, so it makes no sense to check if the controlling terminal
is an interactive terminal or if --force-stderr was given, to decide
if we want to log to standard error.

Make --force-stderr depend on --foreground.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:15 +02:00
Laurent Vivier
0c335d751a vhost-user: compare mode MODE_PASTA and not MODE_PASST
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:38 +02:00
Laurent Vivier
377b666dc9 udp: rename udp_sock_handler() to udp_buf_sock_handler()
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-13 15:45:34 +02:00
David Gibson
7094b91d10 Remove pointless macro parameters in CALL_PROTO_HANDLER
The 'c' parameter is always passed exactly 'c'.  The 'now' parameter is
always passed exactly 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
Stefano Brivio
c9b2413465 conf, passt, tap: Open socket and PID files before switching UID/GID
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.

Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:26 +02:00
Stefano Brivio
ba23b05545 passt, util: Move opening of PID file to its own function
We won't call it from main() any longer: move it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:13 +02:00
Stefano Brivio
57d8aa8ffe util: Rename write_pidfile() to pidfile_write()
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:05 +02:00
Stefano Brivio
fcfb592adc passt, tap: Don't use -1 as uninitialised value for fd_tap_listen
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.

Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.

Drop the -1 initalisation and the corresponding check.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-05-23 16:42:27 +02:00
lemmi
26c71db332 passt.c: explicitly include libgen.h for basename
fixes implicit declaration warning on musl

Signed-off-by: lemmi <lemmi@nerd2nerd.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-13 23:00:52 +02:00
Stefano Brivio
5d5208b67d treewide: Compilers' name for armv6l and armv7l is "arm"
When I switched from 'uname -m' to 'gcc -dumpmachine' to fetch the
architecture name for, among others, seccomp.sh, I didn't realise
that "armv6l" and "armv7l" are just Linux kernel names -- compilers
just call that "arm".

Fix the "syscalls" annotation we use to define seccomp profiles
accordingly, otherwise pasta will be terminated on sigreturn() on
armv6l and armv7l.

Fixes: 213c397492 ("passt, pasta: Run-time selection of AVX2 build")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-11 17:34:04 +02:00
Stefano Brivio
f67238aa86 passt, log: Call __openlog() earlier, log to stderr until we detach
Paul reports that, with commit 15001b39ef ("conf: set the log level
much earlier"), early messages aren't reported to standard error
anymore.

The reason is that, once the log mask is changed from LOG_EARLY, we
don't force logging to stderr, and this mechanism was abused to have
early errors on stderr. Now that we drop LOG_EARLY earlier on, this
doesn't work anymore.

Call __openlog() as soon as we know the mode we're running as, using
LOG_PERROR. Then, once we detach, if we're not running from an
interactive terminal and logging to standard error is not forced,
drop LOG_PERROR from the options.

While at it, check if the standard error descriptor refers to a
terminal, instead of checking standard output: if the user redirects
standard output to /dev/null, they might still want to see messages
from standard error.

Further, make sure we don't print messages to standard error reporting
that we couldn't log to the system logger, if we didn't open a
connection yet. That's expected.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 15001b39ef ("conf: set the log level much earlier")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:19:36 +01:00
David Gibson
4779dfe12f icmp: Use 'flowside' epoll references for ping sockets
Currently ping sockets use a custom epoll reference type which includes
the ICMP id.  However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id.  Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.

Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:49:05 +01:00
David Gibson
3af5e9fdba icmp: Store ping socket information in flow table
Currently icmp_id_map[][] stores information about ping sockets in a
bespoke structure.  Move the same information into new types of flow
in the flow table.  To match that change, replace the existing ICMP
timer with a flow-based timer for expiring ping sockets.  This has the
advantage that we only need to scan the active flows, not all possible
ids.

We convert icmp_id_map[][] to point to the flow table entries, rather
than containing its own information.  We do still use that array for
locating the right ping flows, rather than using a "flow native" form
of lookup for the time being.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Update id_sock description in comment to icmp_ping_new()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:34:45 +01:00
Paul Holzinger
15001b39ef conf: set the log level much earlier
--quiet is supposed to silence the "No routable interface" message but
it does not work because the log level was set long after conf_ip4/6()
was called which means it uses the default level which logs everything.

To address this move the log level logic directly after the option
parsing in conf().

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 14:08:33 +01:00
Paul Holzinger
b08716551a passt: make --quiet set the log level to warning
Based on the man page and help output --quiet hides informational
messages. This means that warnings should still be logged. This was
discussed in[1].

[1] https://archives.passt.top/passt-dev/20240216114304.7234a83f@elisabeth/T/#m42652824644973674e84baf9e0bf1d0e88104450

Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 14:08:00 +01:00
Stefano Brivio
8f3f8e190c pasta: Add fallback timer mechanism to check if namespace is gone
We don't know how frequently this happens, but hitting
fs.inotify.max_user_watches or similar sysctl limits is definitely
not out of question, and Paul mentioned that, for example, Podman's
CI environments hit similar issues in the past.

Introduce a fallback mechanism based on a timer file descriptor: we
grab the directory handle at startup, and we can then use openat(),
triggered periodically, to check if the (network) namespace directory
still exists. If openat() fails at some point, exit.

Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-16 08:47:14 +01:00
Stefano Brivio
7ee4e17267 log: setlogmask(0) can actually result in a system call, don't use it
Before commit 32d07f5e59 ("passt, pasta: Completely avoid dynamic
memory allocation"), we didn't store the current log mask in a
variable, and we fetched it using setlogmask(0) wherever needed.

But after that commit, we can use our log_mask copy instead. And we
should: with recent glibc versions, setlogmask(0) actually results in
a system call, which causes a substantial overhead with high transfer
rates: we use setlogmask(0) even to decide we don't want to print
debug messages.

Now that we rely on log_mask in early stages, before setlogmask() is
called, we need to initialise that variable to the special LOG_EMERG
mask value right away: define LOG_EARLY to make this clearer, and,
while at it, group conditions in vlogmsg() into something more terse.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-14 01:10:11 +01:00
David Gibson
a325121759 icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()
Currently we have separate handlers for ICMP and ICMPv6 ping replies.
Although there are a number of points of difference, with some creative
refactoring we can combine these together sensibly.  Although it doesn't
save a vast amount of code, it does make it clearer that we're performing
basically the same steps for each case.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:36:51 +01:00
David Gibson
8981a720aa flow: Avoid moving flow entries to compact table
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table.  Doing this
sometimes requires moving an entry when one is freed.  That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move
entries.  It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
 Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:37 +01:00
David Gibson
02e092b4fe tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection.  This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function.  We can avoid this by adding a new
epoll type and dispatching directly to the right thing.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:25 +01:00
David Gibson
70121ca1ec epoll: Better handling of number of epoll types
As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of
EPOLL_TYPE_MAX, which is a little bit safer and clearer.  Add a static
assert on the size of the matching names array.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:22 +01:00
David Gibson
36dfa8b8fb flow, tcp: Add handling for per-flow timers
tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection.  More generally, other flow types might want to
run similar timers in future.

We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler().  We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.

So, extend flow_defer_handler() to do this.  For now we use the same timer
interval for all flow types (1s).  We can make that more flexible in future
if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:19 +01:00
David Gibson
b43e4483ed flow, tcp: Add flow-centric dispatch for deferred flow handling
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection.  When we add other protocols to
the flow table, they're likely to want some similar scanning.  It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:17 +01:00
David Gibson
e21b6d69b1 tcp: "TCP" hash secret doesn't need to be TCP specific
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.

We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret.  So move the hash_secret
to the general context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:32 +01:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
ae5f6c8e1b epoll: Use different epoll types for passt and pasta tap fds
Currently we have a single epoll event type for the "tap" fd, which could
be either a handle on a /dev/net/tun device (pasta) or a connected Unix
socket (passt).  However for the two modes we call different handler
functions.  Simplify this a little by using different epoll types and
dispatching directly to the correct handler function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:20 +02:00
David Gibson
eda4f1997e epoll: Split listening Unix domain socket into its own type
tap_handler() actually handles events on three different types of object:
the /dev/tap character device (pasta), a connected Unix domain socket
(passt) or a listening Unix domain socket (passt).

The last, in particular, really has no handling in common with the others,
so split it into its own epoll type and directly dispatch to the relevant
handler from the top level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:17 +02:00
David Gibson
485b5fb8f9 epoll: Split handling of listening TCP sockets into their own handler
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common.  Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level.  Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning.  So,
switch listening sockets to their own reference type which we can lay out
as we please.  That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:15 +02:00
David Gibson
e6f81e5578 epoll: Split handling of TCP timerfds into its own handler function
tcp_sock_handler() actually handles several different types of fd events.
This includes timerfds that aren't sockets at all.  The handling of these
has essentially nothing in common with the other cases.  So, give the
TCP timers there own epoll_type value and dispatch directly to their
handler.  This also means we can remove the timer field from tcp_epoll_ref,
the information it encoded is now implicit in the epoll_type value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:13 +02:00
David Gibson
8271a2ed57 epoll: Tiny cleanup to udp_sock_handler()
Move the test for c->no_udp into the function itself, rather than in the
dispatching switch statement to better localize the UDP specific logic, and
make for greated consistency with other handler functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:10 +02:00
David Gibson
05f606ab0b epoll: Split handling of ICMP and ICMPv6 sockets
We have different epoll type values for ICMP and ICMPv6 sockets, but they
both call the same handler function, icmp_sock_handler().  However that
function does essentially nothing in common for the two cases.  So, split
it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them
separately from the top level.

While we're there remove some parameters that the function was never using
anyway.  Also move the test for c->no_icmp into the functions, so that all
the logic specific to ICMP is within the handler, rather than in the top
level dispatch code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:07 +02:00
David Gibson
d850caab66 epoll: Fold sock_handler into general switch on epoll event fd
When we process events from epoll_wait(), we check for a number of special
cases before calling sock_handler() which then dispatches based on the
protocol type of the socket in the event.

Fold these cases together into a single switch on the fd type recorded in
the epoll data field.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:56 +02:00