1
0
mirror of https://passt.top/passt synced 2025-01-12 07:32:58 +00:00

120 Commits

Author SHA1 Message Date
Laurent Vivier
7d1cd4dbf5 vhost-user: introduce virtio API
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:11:36 +01:00
David Gibson
71d5deed5e util: Add general low-level random bytes helper
Currently secret_init() open codes getting good quality random bytes from
the OS, either via getrandom(2) or reading /dev/random.  We're going to
add at least one more place that needs random data in future, so make a
general helper for getting random bytes.  While we're there, fix a number
of minor bugs:
 - getrandom() can theoretically return a "short read", so handle that case
 - getrandom() as well as read can return a transient EINTR
 - We would attempt to read data from /dev/random if we failed to open it
   (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok)
 - More specific error reporting

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:36 +01:00
David Gibson
d64f257243 linux_dep: Move close_range() conditional handling to linux_dep.h
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions.  Move this to linux_dep.h which handles several
other similar cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:15 +01:00
Stefano Brivio
d4f09c9b96 util: Define small and big thresholds for socket buffers as unsigned long long
On 32-bit architectures, clang-tidy reports:

/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~

because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:49 +01:00
Stefano Brivio
5f5e814cfc dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning
cppcheck 2.16.0 reports:

dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
 if (ia_type == OPT_IA_NA) {
             ^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
 ia_type = OPT_IA_NA;
           ^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
 if (ia_type == OPT_IA_NA) {
             ^

this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.

Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:11 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
Stefano Brivio
59fe34ee36 treewide: Suppress clang-tidy warning if we already use O_CLOEXEC
In pcap_init(), we should always open the packet capture file with
O_CLOEXEC, even if we're not running in foreground: O_CLOEXEC means
close-on-exec, not close-on-fork.

In logfile_init() and pidfile_open(), the fact that we pass a third
'mode' argument to open() seems to confuse the android-cloexec-open
checker in LLVM versions from 16 to 19 (at least).

The checker is suggesting to add O_CLOEXEC to 'mode', and not in
'flags', where we already have it.

Add a suppression for clang-tidy and a comment, and avoid repeating
those three times by adding a new helper, output_file_open().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-30 12:37:31 +01:00
Stefano Brivio
744247856d treewide: Silence cert-err33-c clang-tidy warnings for fprintf()
We use fprintf() to print to standard output or standard error
streams. If something gets truncated or there's an output error, we
don't really want to try and report that, and at the same time it's
not abnormal behaviour upon which we should terminate, either.

Just silence the warning with an ugly FPRINTF() variadic macro casting
the fprintf() expressions to void.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:31 +01:00
Stefano Brivio
98efe7c2fd treewide: Comply with CERT C rule ERR33-C for snprintf()
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):

/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
   43 |                 snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
      |                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  577 |                         snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  579 |                                 snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  580 |                                          pidval);
      |                                          ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  105 |         snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  242 |         snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
  243 |         snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
 1155 |                         snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning

Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-10-30 12:37:25 +01:00
David Gibson
b8d4fac6a2 util, pif: Replace sock_l4() with pif_sock_l4()
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.

Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress.  This case doesn't yet arise, but
it might cause us trouble in future.

Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic.  This only makes sense if there is no specific
address given.  We verify this at runtime, but it would be nicer if we
could enforce it structurally.

For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems.  Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.

The new function takes the address as an inany *, with NULL indicating the
dual stack case.  This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2024-09-25 19:03:15 +02:00
David Gibson
bfc294b90d util: Add helper to write() all of a buffer
write(2) might not write all the data it is given.  Add a write_all_buf()
helper to keep calling it until all the given data is written, or we get an
error.

Currently we use write_remainder() to do this operation in pcap_frame().
That's a little awkward since it requires constructing an iovec, and future
changes we want to make to write_remainder() will be easier in terms of
this single buffer helper.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-09-18 17:14:59 +02:00
David Gibson
1daf6f4615 conf, fwd: Make ephemeral port logic more flexible
"Ephemeral" ports are those which the kernel may allocate as local
port numbers for outgoing connections or datagrams.  Because of that,
they're generally not good choices for listening servers to bind to.

Thefore when using -t all, -u all or exclude-only ranges, we map only
non-ephemeral ports.  Our logic for this is a bit rigid though: we
assume the ephemeral ports are always a fixed range at the top of the
port number space.  We also assume PORT_EPHEMERAL_MIN is a multiple of
8, or we won't set the forward bitmap correctly.

Make the logic in conf.c more flexible, using a helper moved into
fwd.[ch], although we don't change which ports we consider ephemeral
(yet).

The new handling is undoubtedly more computationally expensive, but
since it's a once-off operation at start off, I don't think it really
matters.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-29 22:25:51 +02:00
Stefano Brivio
6e9ecf5741 util: Provide own version of close_range(), and no-op fallback
musl, as of 1.2.5, and glibc < 2.34 don't ship a (trivial)
close_range() implementation. This will probably be added to musl
soon, by the way:
  https://www.openwall.com/lists/musl/2024/08/01/9

Add a weakly-aliased implementation, if it's supported by the kernel.
If it's not supported (< 5.9), use a no-op fallback. Looping over 2^31
file descriptors calling close() on them is probably not a good idea.

Reported-by: lemmi <lemmi@nerd2nerd.org>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-21 12:03:48 +02:00
David Gibson
90e83d50a9 Don't take "our" MAC address from the host
When sending frames to the guest over the tap link, we need a source MAC
address.  Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
 * We can't preserve the real MAC address of packets from anywhere
   external so there's no transparency case here
 * In fact, it's confusingly different from how we handle IP addresses:
   whereas we give the guest the same IP as the host, we're making the
   host's MAC the one MAC that the guest *can't* use for itself.
 * We already need a fallback case if the host doesn't have an Ethernet
   like MAC (e.g. if it's connected via a point to point interface, such
   as a wireguard VPN).

Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55.  It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps.  This can still, of course, be overridden with the -M option.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:28 +02:00
David Gibson
066e69986b util: Helper for formatting MAC addresses
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display.  Add an eth_ntop() helper for
this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:51 +02:00
Stefano Brivio
09603cab28 passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.

This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.

Use close_range(2) to close all open files except for standard streams
and the one from --fd.

Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.

As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.

Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
2024-08-08 21:31:25 +02:00
David Gibson
95569e4aa4 util: Some corrections for timespec_diff_us
The comment for timespec_diff_us() claims it will wrap after 2^64µs.  This
is incorrect for two reasons:
  * It returns a long long, which is probably 64-bits, but might not be
  * It returns a signed value, so even if it is 64 bits it will wrap after
    2^63µs

Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-07 09:15:40 +02:00
David Gibson
031df332e9 util: Use unsigned (size_t) value for iov length
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t.  Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder.  Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-06 15:01:46 +02:00
David Gibson
37e3b24d90 tap: Correctly handle frames of odd length
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself.  As far as I can tell,
frames can be any length, with no particular alignment requirement.  This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.

Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading.  Some platforms will generate a fatal trap on
such an unaligned load.  Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.

Introduce a new helper to safely load a possibly unaligned value here.  We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads.  If that turns out not
to be the case, we can look at improvements then.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:38 +02:00
Stefano Brivio
327d9d482f log, util: Fix sub-second part in relative log time calculation
For some reason, in commit 01efc71ddd25 ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.

Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.

To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-07-26 13:43:19 +02:00
David Gibson
781164e25b flow: Helper to create sockets based on flowside
We have upcoming use cases where it's useful to create new bound socket
based on information from the flow table.  Add flowside_sock_l4() to do
this for either PIF_HOST or PIF_SPLICE sockets.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:33:23 +02:00
David Gibson
2fa91ee391 udp: Handle errors on UDP sockets
Currently we ignore all events other than EPOLLIN on UDP sockets.  This
means that if we ever receive an EPOLLERR event, we'll enter an infinite
loop on epoll, because we'll never do anything to clear the error.

Luckily that doesn't seem to have happened in practice, but it's certainly
fragile.  Furthermore changes in how we handle UDP sockets with the flow
table mean we will start receiving error events.

Add handling of EPOLLERR events.  For now we just read the error from the
error queue (thereby clearing the error state) and print a debug message.
We can add more substantial handling of specific events in future if we
want to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-17 07:05:21 +02:00
David Gibson
74c1c5efcf util: sock_l4() determine protocol from epoll type rather than the reverse
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state.  Currently it determines the correct tag for the epoll
data based on the protocol.  However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol.  So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:09 +02:00
David Gibson
3f63743a65 util: Use 'long' to represent millisecond durations
timespec_diff_ms() returns an int representing a duration in milliseconds.
This will overflow in about 25 days when an int is 32 bits.  The way we
use this function, we're probably not going to get a result that long, but
it's not outrageously implausible.  Use a long for safety.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
David Gibson
bda80ef53f util: Use unsigned indices for bits in bitmaps
A negative bit index in a bitmap doesn't make sense.  Avoid this by
construction by using unsigned indices.  While we're there adjust
bitmap_isset() to return a bool instead of an int.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-06-07 20:44:44 +02:00
Stefano Brivio
ba23b05545 passt, util: Move opening of PID file to its own function
We won't call it from main() any longer: move it.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:13 +02:00
Stefano Brivio
57d8aa8ffe util: Rename write_pidfile() to pidfile_write()
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:05 +02:00
David Gibson
1a20370b36 util, tcp: Add helper to display socket addresses
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler.  For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-22 23:20:37 +02:00
Laurent Vivier
95601237ef tcp: Replace TCP buffer structure by an iovec array
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.

Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:09 +02:00
Stefano Brivio
27f1c762b1 conf: Don't fail if the template interface doesn't have a MAC address
...simply resort to using locally-administered address (LAA) as
host-side source, instead.

Pick 02:00:00:00:00:00, to make it clear that we don't actually care
about that address, and also to match the 00 (Administratively
Assigned Identifier) quadrant of SLAP (RFC 8948).

Otherwise, pasta refuses to start if the template is a tun or
Wireguard interface.

Link: https://bugs.passt.top/show_bug.cgi?id=49
Link: https://github.com/containers/podman/issues/22320
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-19 11:21:00 +02:00
Stefano Brivio
ee338a256e pasta, util: Align stack area for clones to maximum natural alignment
Given that we use this stack pointer as a location to store arbitrary
data types from the cloned process, we need to guarantee that its
alignment matches any of those possible data types.

runsisi reports that pasta gets a SIGBUS in pasta_open_ns() on
aarch64, where the alignment requirement for stack pointers is a
16 bytes (same as the size of a long double), and similar requirements
actually apply to most architectures we run on.

Reported-by: runsisi <runsisi@hust.edu.cn>
Link: https://bugs.passt.top/show_bug.cgi?id=85
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-04-19 11:15:27 +02:00
David Gibson
67a6258918 util: Add helper to return name of address family
We have a few places where we want to include the name of the internet
protocol version (IPv4 or IPv6) in a message, which we handle with an
open-coded ?: expression.

This seems like something that might be more widely useful, so make a
trivial helper to return the correct string based on the address family.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-04-05 16:59:05 +02:00
Stefano Brivio
3fe9878db7 pcap: Use clock_gettime() instead of gettimeofday()
POSIX.1-2008 declared gettimeofday() as obsolete, but I'm a dinosaur.

Usually, C libraries translate that to the clock_gettime() system
call anyway, but this doesn't happen in Jon's environment, and,
there, seccomp happily kills pasta(1) when started with --pcap,
because we didn't add gettimeofday() to our seccomp profiles.

Use clock_gettime() instead.

Reported-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-03-14 08:18:36 +01:00
Laurent Vivier
324bd46782 util: move IP stuff from util.[ch] to ip.[ch]
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures.  Modify various files to include the new header ip.h when
it's needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:38 +01:00
David Gibson
e196eada6f util: Allow IN4_IS_* macros to operate on untyped addresses
The IN4_IS_*() macros expect a pointer to a struct in_addr.  That
makes sense, but sometimes we have an IPv4 address as a void * pointer
or union type which makes these less convenient.  Additionally, this
doesn't match the behaviour of the standard library's IN6_IS_*()
macros on which they're modelled, nor our own IN4_ARE_ADDR_EQUAL().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:35 +01:00
David Gibson
8bdb0883b4 util: Add write_remainder() helper
We have several places where we want to write(2) a buffer or buffers and we
handle short write()s by retrying until everything is successfully written.
Add a helper for this in util.c.

This version has some differences from the typical write_all() function.
First, take an IO vector rather than a single buffer, because that will be
useful for some of our cases.  Second, allow it to take an parameter to
skip the first n bytes of the given buffers.  This will be useful for some
of the cases we want, and also falls out quite naturally from the
implementation.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor formatting fixes in write_remainder()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 06:25:17 +01:00
David Gibson
4e08d9b9c6 treewide: Use sa_family_t for address family variables
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int.  Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:02 +01:00
David Gibson
eae4304000 util: Use htonl_constant() in more places
We might as well when we're passing a known constant value, giving the
compiler the best chance to optimise things away.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
073f530bfe treewide: Add IN4ADDR_ANY_INIT macro
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address, make a similar one for the unspecified / any address.
This avoids messying things with the internal structure of struct in_addr
where we don't care about it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5913f26415 tcp: Switch hash table to linear probing instead of chaining
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.

That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing.  Here if a hash bucket is occupied
we just move onto the next one until we find a feww one.  This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link.  It does require some
additional complexity for hash removal.

This approach can perform poorly with hash table load is high.  However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load.  It's relatively cheap to decrease
that bound if we find we need to.

I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition.  Specifically Algorithm L and Algorithm
R in Section 6.4.  Note that there is an error in Algorithm R as printed,
see errata at [0].

[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
9d44aba7e0 util: MAX_FROM_BITS() should be unsigned
MAX_FROM_BITS() computes the maximum value representable in a number of
bits.  The expression for that is an unsigned value, but we explicitly cast
it to a signed int.  It looks like this is because one of the main users is
for FD_REF_MAX, which is used to bound fd values, typically stored as a
signed int.

The value MAX_FROM_BITS() is calculating is naturally non-negative, though,
so it makes more sense for it to be unsigned, and to move the case to the
definition of FD_REF_MAX.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:06 +01:00
Stefano Brivio
bae676a44e port_fwd, util: Include additional headers to fix build with musl
lseek() is declared in unistd.h, and stdio.h provides sscanf().
Include these two headers in port_fwd.c.

SIGCHLD, even if used exclusively for clone(), is defined in
signal.h: add the include to util.h, as NS_CALL needs it.

Reported-by: lemmi <lemmi@nerd2nerd.org>
Link: https://github.com/void-linux/void-packages/actions/runs/6999782606/job/19039526604#step:7:57
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-02 03:54:55 +01:00
Stefano Brivio
4117bd94f9 port_fwd, util: Don't bind UDP ports with opposite-side bound TCP ports
When pasta periodically scans bound ports and binds them on the other
side in order to forward traffic, we bind UDP ports for corresponding
TCP port numbers, too, to support protocols and applications such as
iperf3 which use UDP port numbers matching the ones used by the TCP
data connection.

If we scan UDP ports in order to bind UDP ports, we skip detection of
the UDP ports we already bound ourselves, to avoid looping back our
own ports. Same with scanning and binding TCP ports.

But if we scan for TCP ports in order to bind UDP ports, we need to
skip bound TCP ports too, otherwise, as David pointed out:

- we find a bound TCP port on side A, and bind the corresponding TCP
  and UDP ports on side B

- at the next periodic scan, we find that UDP port bound on side B,
  and we bind the corresponding UDP port on side A

- at this point, we unbind that UDP port on side B: we would
  otherwise loop back our own port.

To fix this, we need to avoid binding UDP ports that we already
bound, on the other side, as a consequence of finding a corresponding
bound TCP port.

Reproducing this issue is straightforward:

  ./pasta -- iperf3 -s

  # Wait one second, then from another terminal:
  iperf3 -c ::1 -u

Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 457ff122e33c ("udp,pasta: Periodically scan for ports to automatically forward")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-22 07:19:36 +01:00
David Gibson
4f0b9f91e4 util: Add open_in_ns() helper
Most of our helpers which need to enter the pasta network namespace are
quite specialised.  Add one which is rather general - it just open()s a
given file in the namespace context and returns the fd back to the main
namespace.  This will have some future uses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:18 +01:00
David Gibson
e90f2770ae port_fwd: Move automatic port forwarding code to port_fwd.[ch]
The implementation of scanning /proc files to do automatic port forwarding
is a bit awkwardly split between procfs_scan_listen() in util.c,
get_bound_ports() and related functions in conf.c and the initial setup
code in conf().

Consolidate all of this into port_fwd.h, which already has some related
definitions, and a new port_fwd.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:14 +01:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
c1d2a070f2 util: Consolidate and improve workarounds for clang-tidy issue 58992
We have several workarounds for a clang-tidy bug where the checker doesn't
recognize that a number of system calls write to - and therefore initialise
- a socket address.  We can't neatly use a suppression, because the bogus
warning shows up some time after the actual system call, when we access
a field of the socket address which clang-tidy erroneously thinks is
uninitialised.

Consolidate these workarounds into one place by using macros to implement
wrappers around affected system calls which add a memset() of the sockaddr
to silence clang-tidy.  This removes the need for the individual memset()
workarounds at the callers - and the somewhat longwinded explanatory
comments.

We can then use a #define to not include the hack in "real" builds, but
only consider it for clang-tidy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:26:06 +02:00
David Gibson
5b6c68c2e4 Avoid shadowing index(3)
A classic gotcha of the standard C library is that its unwise to call any
variable 'index' because it will shadow the standard string library
function index(3).  This can cause warnings from cppcheck amongst others,
and it also means that if the variable is removed you tend to get confusing
type errors (or sometimes nothing at all) instead of a nice simple "name is
not defined" error.

Strictly speaking this only occurs if <string.h> is included, but that
is so common that as a rule it's best to just avoid it always.  We
have a number of places which hit this trap, so rename variables and
parameters to avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:25:51 +02:00