1
0
mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00
Commit Graph

1783 Commits

Author SHA1 Message Date
David Gibson
71a16dbc49 tcp: Move tcp_l2_buf_fill_headers() to tcp_buf.c
This function only has callers in tcp_buf.c.  More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.

Therefore, move it to tcp_buf.c.

Slightly reformat the prototypes while we're at it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
2024-11-15 10:55:53 +01:00
David Gibson
3958736de5 tcp_vu: Share more header construction between IPv4 and IPv6 paths
tcp_vu_send_flag() and tcp_vu_prepare() both needs to do some different
things for IPv4 vs. IPv6.  However the two paths have a number of lines of
duplicated code.  We can share those at the expense of an additional
conditional (which we might be able to simplify again later).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
2024-11-15 10:55:53 +01:00
Stefano Brivio
9392ea7e5a test: Add tests for passt in vhost-user mode
Run functional and performance tests for vhost-user mode as well. For
functional tests, we add passt_vu and passt_vu_in_ns as symbolic links
to their non-vhost-user counterparts, as no differences are intended
but we want to distinguish them in test logs.

For performance tests, instead, we add separate perf/passt_vu_tcp and
perf/passt_vu_udp files, as we need longer test duration, as well as
higher UDP sending bandwidths and larger TCP windows, to actually get
the highest throughput vhost-user mode offers.

For valgrind tests, vhost-user mode needs two extra system calls:
statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind
target.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-15 10:55:53 +01:00
Laurent Vivier
92fe7e967a vhost-user: add vhost-user
add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
2024-11-15 10:55:53 +01:00
Laurent Vivier
007af94bb9 passt: rename tap_sock_init() to tap_backend_init()
Extract pool storage initialization loop to tap_sock_update_pool(),
extract QEMU hints to tap_backend_show_hints().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-15 10:54:01 +01:00
Laurent Vivier
1ceee36c57 tcp: Export headers functions
Export tcp_fill_headers[4|6]() and tcp_update_check_tcp[4|6]().

They'll be needed by vhost-user.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-15 10:53:40 +01:00
Laurent Vivier
7f6b184fb8 udp: Prepare udp.c to be shared with vhost-user
Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and
udp_sock_errs().

Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and
udp_reply_sock_handler to udp_buf_reply_sock_handler().

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-15 10:53:40 +01:00
Laurent Vivier
23cc8f892f vhost-user: introduce vhost-user API
Add vhost_user.c and vhost_user.h that define the functions needed
to implement vhost-user backend.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
2024-11-15 10:53:40 +01:00
Laurent Vivier
119b45358c vhost-user: introduce virtio API
Add virtio.c and virtio.h that define the functions needed
to manage virtqueues.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
2024-11-15 10:53:40 +01:00
Laurent Vivier
8ac20f4795 packet: replace struct desc by struct iovec
To be able to manage buffers inside a shared memory provided
by a VM via a vhost-user interface, we cannot rely on the fact
that buffers are located in a pre-defined memory area and use
a base address and a 32bit offset to address them.

We need a 64bit address, so replace struct desc by struct iovec
and update range checking.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-15 10:53:40 +01:00
Stefano Brivio
7f6c10626d selinux: Use auth_read_passwd() interface for all our getpwnam() needs
If passt or pasta are started as root, we need to read the passwd file
(be it /etc/passwd or whatever sssd provides) to find out UID and GID
of 'nobody' so that we can switch to it.

Instead of a bunch of allow rules for passwd_file_t and sssd macros,
use the more convenient auth_read_passwd() interface which should
cover our usage of getpwnam().

The existing rules weren't actually enough:

  # strace -e openat passt -f
  [...]
  Started as root, will change to nobody.
  openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/lib64/libnss_sss.so.2", O_RDONLY|O_CLOEXEC) = 4
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
  openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 4

with corresponding SELinux warnings logged in audit.log.

Reported-by: Minxi Hou <mhou@redhat.com>
Analysed-by: Miloš Malik <mmalik@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 23:41:52 +01:00
David Gibson
6e1e44293e ndp: Send unsolicited Router Advertisements
Currently, our NDP implementation only sends Router Advertisements (RA)
when it receives a Router Solicitation (RS) from the guest.  However,
RFC 4861 requires that we periodically send unsolicited RAs.

Linux as a guest also requires this: it will send an RS when a link first
comes up, but the route it gets from this will have a finite lifetime (we
set this to 65535s, the maximum allowed, around 18 hours).  When that
expires the guest will not send a new RS, but instead expects the route to
have been renewed (if still valid) by an unsolicited RA.

Implement sending unsolicited RAs on a partially randomised timer, as
required by RFC 4861.  The RFC also specifies that solicited RAs should
also be delayed, or even omitted, if the next unsolicited RA is soon
enough.  For now we don't do that, always sending an immediate RA in
response to an RS.  We can get away with this because in our use cases
we expect to just have passt itself and the guest on the link, rather than
a large broadcast domain.

Link: https://github.com/kubevirt/kubevirt/issues/13191
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:40 +01:00
David Gibson
b39760cc7d passt: Seed libc's pseudo random number generator
We have an upcoming case where we need pseudo-random numbers to scatter
timings, but we don't need cryptographically strong random numbers.  libc's
built in random() is fine for this purpose, but we should seed it.  Extend
secret_init() - the only current user of random numbers - to do this as
well as generating the SipHash secret.  Using /dev/random for a PRNG seed
is probably overkill, but it's simple and we only do it once, so we might
as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:38 +01:00
David Gibson
71d5deed5e util: Add general low-level random bytes helper
Currently secret_init() open codes getting good quality random bytes from
the OS, either via getrandom(2) or reading /dev/random.  We're going to
add at least one more place that needs random data in future, so make a
general helper for getting random bytes.  While we're there, fix a number
of minor bugs:
 - getrandom() can theoretically return a "short read", so handle that case
 - getrandom() as well as read can return a transient EINTR
 - We would attempt to read data from /dev/random if we failed to open it
   (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok)
 - More specific error reporting

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:36 +01:00
David Gibson
a60703e899 ndp: Make route lifetime a #define
Currently we open-code the lifetime of the route we advertise via NDP to be
65535s (the maximum).  Change it to a #define.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:34 +01:00
David Gibson
36c070e6e3 ndp: Use struct assignment in preference to memcpy() for IPv6 addresses
There are a number of places we can simply assign IPv6 addresses about,
rather than the current mildly ugly memcpy().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:31 +01:00
David Gibson
cbc83e14df ndp: Split out helpers for sending specific NDP message types
Currently the large ndp() function responds to all NDP messages we handle,
both parsing the message as necessary and sending the response.  Split out
the code to construct and send specific message types into ndp_na() (to
send NA messages) and ndp_ra() (to send RA messages).

As well as breaking up an excessively large function, this is a first step
to being able to send unsolicited NDP messages.

While we're there, remove a slighty ugly goto.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:29 +01:00
David Gibson
4e47167035 ndp: Add ndp_send() helper
ndp() has a conditional on message type generating the reply message, then
a tiny amount of common code, then another conditional to send the reply
with slightly different parameters.  We can make this a bit neater by
making a helper function for sending the reply, and call it from each of
the different message type paths.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:28 +01:00
David Gibson
71f228d04b ndp: Remove redundant update to addr_seen
ndp() updates addr_seen or addr_ll_seen based on the source address of the
received packet.  This is redundant since tap6_handler() has already
updated addr_seen for any type of packet, not just NDP.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-14 19:00:13 +01:00
David Gibson
0588163b1f cppcheck: Don't check the system headers
We pass -I options to cppcheck so that it will find the system headers.
Then we need to pass a bunch more options to suppress the zillions of
cppcheck errors found in those headers.

It turns out, however, that it's not recommended to give the system headers
to cppcheck anyway.  Instead it has built-in knowledge of the ANSI libc and
uses that as the basis of its checks.  We do need to suppress
missingIncludeSystem warnings instead though.

Not bothering with the system headers makes the cppcheck runtime go from
~37s to ~14s on my machine, which is a pretty nice win.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:21 +01:00
David Gibson
14dd70e2b3 linux_dep: Fix CLOSE_RANGE_UNSHARE availability handling
If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of
close_range() which is a (successful) no-op.  This is broken in several
ways:
 * It doesn't actually fix compile if using old kernel headers, because
   the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE
   unprotected by ifdefs
 * Even if it did fix the compile, it means inconsistent behaviour between
   a compile time failure to find the value (we silently don't close files)
   and a runtime failure (we die with an error from close_range())
 * Silently not closing the files we intend to close for security reasons
   is probably not a good idea in any case

We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't
available, because that would require running on kernel >= 5.9.  On the
other hand there's not really any other way to flush all possible fds
leaked by the parent (close() in a loop takes over a minute).  So in this
case print a warning and carry on.

As bonus this fixes a cppcheck error I see with some different options I'm
looking to apply in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:17 +01:00
David Gibson
d64f257243 linux_dep: Move close_range() conditional handling to linux_dep.h
util.h has some #ifdefs and weak definitions to handle compatibility with
various kernel versions.  Move this to linux_dep.h which handles several
other similar cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:26:15 +01:00
David Gibson
b84cd05098 log: Only check for FALLOC_FL_COLLAPSE_RANGE availability at runtime
log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt
to use it if not defined.  But even if the value is defined at compile
time, it might not be available in the runtime kernel, so we need to check
for errors from a fallocate() call and fall back to other methods.

Simplify this to only need the runtime check by using linux_dep.h to define
FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-08 08:25:58 +01:00
Stefano Brivio
58fa5508bd tap, tcp, util: Add some missing SOCK_CLOEXEC flags
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:

/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1139 |         int fd = socket(AF_UNIX, SOCK_STREAM, 0);
      |                                             ^
      |                                              | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1158 |                 ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
      |                                                                 ^
      |                                                                  | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
 1413 |         s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
      |                                                   ^
      |                                                    | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
  188 |         if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
      |                                             ^
      |                                              | SOCK_CLOEXEC

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:58 +01:00
Stefano Brivio
71869e2912 passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE
For some reason, this is only reported by clang-tidy 19.1.2 on
Alpine:

/home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors]
  314 |         nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
      |                                                            ^

We do have a suppression, but not on the line preceding it, because
we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND
for the clang-tidy suppression.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:52 +01:00
Stefano Brivio
d4f09c9b96 util: Define small and big thresholds for socket buffers as unsigned long long
On 32-bit architectures, clang-tidy reports:

/home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^
/home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~~~~~~~~
/home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type
  728 |         if (v >= SNDBUF_BIG)
      |                  ^
/home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG'
  158 | #define SNDBUF_BIG              (4UL * 1024 * 1024)
      |                                  ^~~~~~~~~~
/home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type
  730 |         else if (v > SNDBUF_SMALL)
      |                      ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~
/home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors]
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^
/home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~~~~~~~~
/home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type
  731 |                 v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2;
      |                               ^
/home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL'
  159 | #define SNDBUF_SMALL            (128UL * 1024)
      |                                  ^~~~~

because, wherever we use those thresholds, we define the other term
of comparison as uint64_t. Define the thresholds as unsigned long long
as well, to make sure we match types.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:49 +01:00
Stefano Brivio
87940f9aa7 tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.

This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:

/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
 1087 |         for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
      |                     ~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:45 +01:00
Stefano Brivio
1feb90fe62 dhcpv6: Turn some option headers pointers to const
cppcheck 2.14.2 on Alpine reports:

dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer]
 struct opt_hdr *ia, *bad_ia, *client_id;
                               ^

It's not only 'client_id': we can declare 'ia' as const pointer too.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:41 +01:00
Stefano Brivio
5f5e814cfc dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning
cppcheck 2.16.0 reports:

dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse]
 if (ia_type == OPT_IA_NA) {
             ^
dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here.
 ia_type = OPT_IA_NA;
           ^
dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true.
 if (ia_type == OPT_IA_NA) {
             ^

this is not really the case as we set ia_type to OPT_IA_TA and then
jump back.

Anyway, there's no particular reason to use a goto here: add a trivial
foreach() macro to go through elements of an array and use it instead.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-11-08 08:24:11 +01:00
Jon Maloy
78da088f7b tcp: unify payload and flags l2 frames array
In order to reduce static memory and code footprint, we merge
the array for l2 flag frames into the one for payload frames.

This change also ensures that no flag message will be sent out
over the l2 media bypassing already queued payload messages.

Performance measurements with iperf3, where we force all
traffic via the tap queue, show no significant difference:

Dual traffic both directions sinmultaneously, with patch:
========================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  36.3 GBytes  3.12 Gbits/sec  4759       sender
[  5]   0.00-100.04 sec  36.3 GBytes  3.11 Gbits/sec             receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   321 GBytes  27.6 Gbits/sec            receiver

Dual traffic both directions sinmultaneously, without patch:
============================================================
host->ns:
--------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec  35.0 GBytes  3.01 Gbits/sec  6001       sender
[  5]   0.00-100.04 sec  34.8 GBytes  2.99 Gbits/sec            receiver

ns->host
--------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   345 GBytes  29.6 Gbits/sec            receiver

Single connection, with patch:
==============================
host->ns:
---------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   138 GBytes  11.8 Gbits/sec  922       sender
[  5]   0.00-100.04 sec   138 GBytes  11.8 Gbits/sec            receiver

ns->host:
-----------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   430 GBytes  36.9 Gbits/sec            receiver

Single connection, without patch:
=================================
host->ns:
------------
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-100.00 sec   139 GBytes  11.9 Gbits/sec  900       sender
[  5]   0.00-100.04 sec   139 GBytes  11.9 Gbits/sec            receiver

ns->host:
---------
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-100.00 sec   440 GBytes  37.8 Gbits/sec            receiver

Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:41 +01:00
David Gibson
9a0e544f05 test: Improve test for NDP assigned prefix
In the NDP tests we search explicitly for a guest address with prefix
length 64.  AFAICT this is an attempt to specifically find the SLAAC
assigned address, rather than something assigned by other means.  We can do
that more explicitly by checking for .protocol == "kernel_ra". however.

The SLAAC prefixes we assigned *will* always be 64-bit, that's hard-coded
into our NDP implementation.  RFC4862 doesn't really allow anything else
since the interface identifiers for an Ethernet-like link are 64-bits.

Let's actually verify that, rather than just assuming it, by extracting the
prefix length assigned in the guest and checking it as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:37 +01:00
David Gibson
910f4f9103 test: Don't require 64-bit prefixes in perf tests
When determining the namespace's IPv6 address in the perf test setup, we
explicitly filter for addresses with a 64-bit prefix length.  There's no
real reason we need that - as long as it's a global address we can use it.
I suspect this was copied without thinking from a similar example in the
NDP tests, where the 64-bit prefix length _is_ meaningful (though it's not
entirely clear if the handling is correct there either).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:34 +01:00
David Gibson
1699083f29 test: Make nstool hold robust against interruptions to control clients
Currently nstool die()s on essentially any error.  In most cases that's
fine for our purposes.  However, it's a problem when in "hold" mode and
getting an IO error on an accept()ed socket.  This could just indicate that
the control client aborted prematurely, in which case we don't want to
kill of the namespace we're holding.

Adjust these to print an error, close() the control client socket and
carry on.  In addition, we need to explicitly ignore SIGPIPE in order not
to be killed by an abruptly closed client connection.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:30 +01:00
David Gibson
b456ee1b53 test: Rename propagating signal handler
nstool in "exec" mode will propagate some signals (specifically SIGTERM) to
the process in the namespace it executes.  The signal handler which
accomplishes this is called simply sig_handler().  However, it turns out
we're going to need some other signal handlers, so rename this to the more
specific sig_propagate().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:27 +01:00
David Gibson
867db07fcf util: Work around cppcheck bug 6936
While experimenting with cppcheck options, I hit several false positives
caused by this bug: https://trac.cppcheck.net/ticket/13227

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:24 +01:00
David Gibson
6f913b3af0 udp: Don't dereference uflow before NULL check in udp_reply_sock_handler()
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler().  However, we dereference uflow before that in
an initializer, rather defeating the point.  Rearrange to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:22 +01:00
David Gibson
d8e05a3fe0 ndp: Use const pointer for ndp_ns packet
We don't modify this structure at all.  For some reason cppcheck doesn't
catch this with our current options, but did when I was experimenting with
some different options.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:19 +01:00
David Gibson
0d7b8201ed linux_dep: Generalise tcp_info.h to handling Linux extension compatibility
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers.  There are several other cases where we can remove similar
compile time dependencies on kernel version.  Prepare for that by renaming
tcp_info.h to linux_dep.h.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:16 +01:00
David Gibson
c5f4e4d146 fwd: Squash different-signedness comparison warning
On certain architectures we get a warning about comparison between
different signedness integers in fwd_probe_ephemeral().  This is because
NUM_PORTS evaluates to an unsigned integer.  It's a fixed value, though
and we know it will fit in a signed long on anything reasonable, so add
a cast to suppress the warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:14 +01:00
David Gibson
1e76a19895 util: Remove unused ffsl() function
We supply a weak alias for ffsl() in case it's not defined in our libc.
Except.. we don't have any users for it any more, so remove it.

make cppcheck doesn't spot this at present for complicated reasons, but it
might with tweaks to the options I'm experimenting with.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:11 +01:00
David Gibson
1d7cff3779 clang: Add rudimentary clangd configuration
clangd's default configuration seems to try to treat .h files as C++ not
C.  There are many more spurious warnings generated at present, but this
removes some of the most egregious ones.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:07 +01:00
David Gibson
c560e2f65b Makefile: Don't attempt to auto-detect stack size
We probe the available stack limit in the Makefile using rlimit, then use
that to set the size of the stack when we clone() extra threads.  But
the rlimit at compile time need not be the same as the rlimit at runtime,
so that's not particularly sensible.

Ideally, we'd set the stack size based on an estimate of the actual
maximum stack usage of all our clone()ed functions.  We don't have that
at the moment, but to keep things simple just set it to 1MiB - that's what
the current probe will set things to on my default configuration Fedora 40,
so it's likely to be fine in most cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:47:03 +01:00
David Gibson
13fc6d511e Makefile: Use -DARCH for qrap only
We insert -DARCH for all compiles, based on TARGET_ARCH determined in the
Makefile.  However, this is only used in qrap.c, not anywhere else in
passt or pasta.  Only supply this -D when compiling qrap specifically.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:59 +01:00
David Gibson
7917159005 seccomp: Simplify handling of AUDIT_ARCH
Currently we construct the AUDIT_ARCH variable in the Makefile, then pass
it into the C code with -D.  The only place that uses it, though is the
BPF filter generated by seccomp.sh.  seccomp.sh already needs to do things
differently depending on the arch, so it might as well just insert the
expanded AUDIT_ARCH directly into the generated code, rather than using
a #define.  Arguably this is better, even, since it ensures more locally
that the arch the BPF checks for matches the arch seccomp.sh built the
filter for.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:55 +01:00
David Gibson
93bce404c1 Makefile: Move NETNS_RUN_DIR definition to C code
NETNS_RUN_DIR is set in the Makefile, then passed into the C code with
-D.  But NETNS_RUN_DIR is just a fixed string, it doesn't depend on any
make probes or variables, so there's really no reason to handle it via the
Makefile.  Just move it to a plain #define in conf.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:52 +01:00
David Gibson
c938d8a93e netlink: RTA_PAYLOAD() returns int, not size_t
Since it's the size of a chunk of memory it would seem logical that
RTA_PAYLOAD() returns size_t.  However, it doesn't - it explicitly casts
its result to an int.  RTNH_OK(), which often takes the result of
RTA_PAYLOAD() as a parameter compares it to an int, so using size_t can
result in comparison of different-signed integer warnings from clang.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:48 +01:00
David Gibson
f6b546c6e4 flow: Correct type of flowside_at_sidx()
Due to a copy-pasta error, this returns 'PIF_NONE' instead of NULL on the
failure case.  PIF_NONE expands to 0, which turns into NULL, but it's
still confusing, so fix it.  This removes a clang warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:44 +01:00
David Gibson
30b4f88167 arch: Avoid explicit access to 'environ'
We pass 'environ' to execve() in arch_avc2_exec(), so that we retain the
environment in the current process.  But the declaration of 'environ' is
a bit weird - it doesn't seem to be in a standard header, requiring a
manual explicit declaration.  But, we can avoid needing to reference it
explicitly by using execv() instead of execve().  This removes a clang
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:29 +01:00
David Gibson
b78e72da0b clang: Move clang-tidy configuration from Makefile to .clang-tidy
Currently we configure clang-tidy with a very long command line spelled out
in the Makefile (mostly a big list of lints to disable).  Move it from here
into a .clang-tidy configuration file, so that the config is accessible if
clang-tidy is invoked in other ways (e.g. via clangd) as well.  As a bonus
this also means that we can move the bulky comments about why we're
suppressing various tests inline with the relevant config lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:19 +01:00
David Gibson
8346216c9a Makefile: Simplify exclusion of qrap from static checks
There are things in qrap.c that clang-tidy complains about that aren't
worth fixing.  So, we currently exclude it using $(filter-out).  However,
we already have a make variable which has just the passt sources, excluding
qrap, so we can use that instead of the awkward filter-out expression.

Currently, we still include qrap.c for cppcheck, but there's not much
point doing so: it's, well, qrap, so we don't care that much about lints.
Exclude it from cppcheck as well, for consistency.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-07 12:46:07 +01:00