These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another. That requires some updates
in the entries themselves. Add helpers to make all the necessary updates
for the spliced and non-spliced cases. This will simplify later cleanups.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures. Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules). This
will make subsequent changes a bit safer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The macro CONN_OR_NULL() is used to look up connections by index with
bounds checking. Replace it with an inline function, which means:
- Better type checking
- No danger of multiple evaluation of an @index with side effects
Also add a helper to perform the reverse translation: from connection
pointer to index. Introduce a macro for this which will make later
cleanups easier and safer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Presumably it meant something in the past, but it's no longer used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file. This shows up
an old declaration for a function that is now static, and a duplicate
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clang-tools 15.0.0 appears to have added a new warning that will always
complain about assignments in if statements, which we use in a number of
places in passt/pasta. Encountered on Fedora 37 with
clang-tools-extra-15.0.0-3.fc37.x86_64.
Suppress the new warning so that we can compile and test.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Spotted in Debian's buildd logs: on ia64, clone(2) is not available:
the glibc wrapper is named __clone2() and it takes, additionally,
the size of the stack area passed by the caller.
Add a do_clone() wrapper handling the different cases, and also
taking care of pointing the child's stack in the middle of the
allocated area: on PA-RISC (hppa), handled by clone(), the stack
grows up, and on ia64 the stack grows down, but the register backing
store grows up -- and I think it might be actually used here.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...instead of doing it after the test. Now that we have pre-built
guest images, we might also have old JSON files from previous,
interrupted test runs.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This reverts commit 198f87835dc4 ("tap: Return -EIO from
tap_handler_passt() on inconsistent packet stream") and commit
510dace86ccf ("tap: Keep stream consistent if qemu length descriptor
spans two recv() calls").
I can hit occasional failures in perf/passt_tcp tests where we seem
to be getting excess data at the end of a recv(), and for some reason
I couldn't figure out yet, if we just ignore it, subsequent recv()
calls from qemu return correct data. If we close the connection, qemu
can't talk to us anymore, of course.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If stderr is closed, after we fork to background, glibc's
implementation of perror() will try to re-open it by calling dup(),
upon which the seccomp filter causes the process to terminate,
because dup() is not included in the list of allowed syscalls.
Replace perror() calls that might happen after isolation_postfork().
We could probably replace all of them, but early ones need a bit more
attention as we have to check whether log.c functions work in early
stages.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The development of the Debian package is now at:
https://salsa.debian.org/sbrivio/passt
Drop contrib/debian, it's finally obsolete.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
AppArmor resolves executable links before profile attachment rules
are evaluated, so, as long as pasta is installed as a link to passt,
there's no way to differentiate the two cases. Merge the two profiles
and leave a TODO note behind, explaining two possible ways forward.
Update the rules so that passt and pasta are actually usable, once
the profile is installed. Most required changes are related to
isolation and sandboxing features.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The AUDIT_ARCH defines in seccomp.h corresponding to HPPA are
AUDIT_ARCH_PARISC and AUDIT_ARCH_PARISC64.
Build error spotted in Debian's buildd log on
phantom.physik.fu-berlin.de.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On mips64el, gcc -dumpmachine correctly reports mips64el as
architecture prefix, but for some reason seccomp.h defines
AUDIT_ARCH_MIPSEL64 and not AUDIT_ARCH_MIPS64EL. Mangle AUDIT_ARCH
accordingly.
Build error spotted in Debian's buildd logs from Loongson build.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Drop it from the internal FLAGS variable, but honour -O2 if passed in
CFLAGS. In Debian packages, dpkg-buildflags uses it as hardening
flag, and we get a QA warning if we drop it:
https://qa.debian.org/bls/bytag/W-dpkg-buildflags-missing.html
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
CPPFLAGS allow the user to pass pre-processor flags. This is unlikely
to be needed at the moment, but the Debian Hardening Walkthrough
reasonably requests it to be handled in order to fully support
hardened build flags:
https://wiki.debian.org/HardeningWalkthrough#Handling_dpkg-buildflags_in_your_upstream_build_system
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that we use just the first valid DNS resolver address
configured, or read from resolv.conf(5) on the host, to forward DNS
queries to, in case --dns-forward is used, we don't need to duplicate
dns[] to dns_send[]:
- rename dns_send[] back to dns[]: those are the resolvers we
advertise to the guest/container
- for forwarding purposes, instead of dns[], use a single field (for
each protocol version): dns_host
- and rename dns_fwd to dns_match, so that it's clear this is the
address we are matching DNS queries against, to decide if they need
to be forwarded
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we disable a given IP version automatically (no corresponding
default route on host) or administratively (--ipv4-only or
--ipv6-only options), we don't initialise related buffers and
services (DHCP for IPv4, NDP and DHCPv6 for IPv6). The "tap"
handlers will also ignore packets with a disabled IP version.
However, in commit 3c6ae625101a ("conf, tcp, udp: Allow address
specification for forwarded ports") I happily changed socket
initialisation functions to take AF_UNSPEC meaning "any enabled
IP version", but I forgot to add checks back for the "enabled"
part.
Reported by Paul: on a host without default IPv6 route, but IPv6
enabled, connect, using IPv6, to a port handled by pasta, which
tries to send data to a tap device without initialised buffers
for that IP version and exits because the resulting write() fails.
Simpler way to reproduce: pasta -6 and inbound IPv4 connection, or
pasta -4 and inbound IPv6 connection.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 3c6ae625101a ("conf, tcp, udp: Allow address specification for forwarded ports")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...so that we avoid printing some lines twice because log-level is
still set to LOG_EMERG, as if logging configuration didn't happen
yet.
While at it, note that logging to stderr doesn't really depend on
whether debug mode is enabled or not.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
While it's important to fail in that case, it makes little sense to
fail quietly: it's better to tell qemu explicitly that something went
wrong and that we won't recover, by closing the socket.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I got all paranoid after triggering a divide-by-zero general
protection fault in passt with a qemu version without the virtio_net
TX hang fix, while flooding UDP. I start thinking this was actually
coming from some random changes I was playing with, but before
reaching this conclusion I reviewed once more the relatively short
path in tap_handler_passt() before we start using packet_*()
functions, and found this.
Never observed in practice, but artificially reproduced with changes
in qemu's socket interface: if we don't receive from qemu a complete
length descriptor in one recv() call, or if we receive a partial one
at the end of one call, we currently disregard the rest, which would
make the stream inconsistent.
Nothing really bad happens, except that from that point on we would
disregard all the packets we get until, if ever, we get the stream
back in sync by chance.
Force reading a complete packet length descriptor with a blocking
recv(), if needed -- not just a complete packet later.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we install the binary in /bin, and we have a link from
/usr/bin, change the path in the test itself as well. Otherwise
it works with bash but not with dash for some reason.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can't get rid of qrap quite yet, but at least we should start
telling users it's not going to be needed anymore starting from qemu
7.2.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we require 13c6be96618c ("net: stream: add unix socket")
in qemu to run the tests, we can also assume that commit df8d07081718
("virtio-net: fix bottom-half packet TX on asynchronous completion")
is present, as it was merged before that one.
This fixes the issue we attempted to work around in passt TCP and
UDP performance tests: finally drop that stuff.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
qemu commit 13c6be96618c ("net: stream: add unix socket") introduces
support for native AF_UNIX support, finally making qrap useless.
We can't quite drop that yet until a qemu release includes it, and
then we'll need to wait a while for users to switch anyway, but at
least for tests, we can use that support.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As pasta now configures that target network namespace with
--config-net, we need to wait for addresses and routes to be actually
present. Just sending netlink messages doesn't mean this is done
synchronously.
A more elegant alternative, which probably makes sense regardless of
this test setup, would be to query, from pasta, addresses and routes
we added, and wait until they're there, before proceeding.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we allow loopback DNS addresses to be used as targets for
forwarding, we need to check if DNS answers come from those targets,
before deciding to eventually remap traffic for local redirects.
Otherwise, the source address won't match the one configured as
forwarder, which means that the guest or the container will refuse
those responses.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
With --dns-forward, if the host has a loopback address configured as
DNS server, we should actually use it to forward queries, but, if
--no-map-gw is passed, we shouldn't offer the same address via DHCP,
NDP and DHCPv6, because it's not going to be reachable.
Problematic configuration:
* systemd-resolved configuring the usual 127.0.0.53 on the host: we
read that from /etc/resolv.conf
* --dns-forward specified with an unrelated address, for example
198.51.100.1
We still want to forward queries to 127.0.0.53, if we receive one
directed to 198.51.100.1, so we can't drop 127.0.0.53 from our list:
we want to use it for forwarding. At the same time, we shouldn't
offer 127.0.0.53 to the guest or container either.
With this change, I'm only covering the case of automatically
configured DNS servers from /etc/resolv.conf. We could extend this to
addresses configured with command-line options, but I don't really
see a likely use case at this point.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Seen in a Google Compute Engine environment with a machine configured
via cloud-init-dhcp, while testing Podman integration for pasta: the
assigned address has a /32 netmask, and there's a default route,
which can be added on the host because there's another route, also
/32, pointing to the default gateway. For example, on the host:
ip -4 address add 10.156.0.2/32 dev eth0
ip -4 route add 10.156.0.1/32 dev eth0
ip -4 route add default via 10.156.0.1
This is not a valid configuration as far as I can tell: if the
address is configured as /32, it shouldn't be used to reach a gateway
outside its derived netmask. However, Linux allows that, and
everything works.
The problem comes when pasta --config-net sources address and default
route from the host, and it can't configure the route in the target
namespace because the gateway is invalid. That is, we would skip
configuring the first route in the example, which results in the
equivalent of doing:
ip -4 address add 10.156.0.2/32 dev eth0
ip -4 route add default via 10.156.0.1
where, at this point, 10.156.0.1 is unreachable, and hence invalid
as a gateway.
Sourcing more routes than just the default is doable, but probably
undesirable: pasta users want to provide connectivity to a container,
not reflect exactly whatever trickery is configured on the host.
Add a consistency check and an adjustment: if the configured default
gateway is not reachable, shrink the given netmask until we can reach
it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A number of functions describe themselves as taking a pointer to 'sin_addr
or sin6_addr'. Those are field names, not type names. Replace them with
the correct type names, in_addr or in6_addr.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We recently converted to using struct in_addr rather than bare in_addr_t
or uint32_t to represent IPv4 addresses in network order. This makes it
harder forget to apply the correct endian conversions.
We omitted the IPv4 addresses stored in struct tap4_l4_t, however. Convert
those as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We recently corrected some errors handling the endianness of IPv4
addresses. These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.
To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses. This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.
We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values. This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This macro checks if an IPv4 address is in the loopback network
(127.0.0.0/8). There are two places where we open code an identical check,
use the macro instead.
There are also a number of places we specifically exclude the loopback
address (127.0.0.1), but we should actually be excluding anything in the
loopback network. Change those sites to use the macro as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are several minor problems with our parsing of IPv4 netmasks (-n).
First, we don't reject nonsensical netmasks like 0.255.0.255. Address this
structurally by using prefix length instead of netmask as the primary
variable, only converting (and validating) when we need to. This has the
added benefit of making some things more uniform with the IPv6 path.
Second, when the user specifies a prefix length, we truncate the output
from strtol() to an integer, which means we would treat -n 4294967320 as
valid (equivalent to 24). Fix types to check for this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The INADDR_LOOPBACK constant is in host endianness, and similarly the
IN_MULTICAST macro expects a host endian address. However, there are some
places in passt where we use those with network endian values. This means
that passt will incorrectly allow you to set 127.0.0.1 or a multicast
address as the guest address or DNS forwarding address. Add the necessary
conversions to correct this.
INADDR_ANY and INADDR_BROADCAST logically behave the same way, although
because they're palindromes it doesn't have an effect in practice. Change
them to be logically correct while we're there, though.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These show a summary of memory usage in kernel and userspace with
different port forwarding configurations, details of userspace usage
using 'nm' (passt only uses statically allocated memory), and details
of kernel memory from slab reporting facilities.
This adds a new test image, mbuto.mem.img, with harcoded IPv4 and
IPv6 addresses and routes, and just the tools we need to start and
stop passt, to report from /proc/slabinfo, /proc/meminfo, and to
print and parse symbol sizes using nm(1).
passt can't pivot_root() for sandboxing purposes on ramfs, so we need
to create another filesystem and chroot into it, first.
We don't want to use pane context functions, as we're checking memory
usage for sockets: resort to screen-scraping.
Configure a dummy interface to provide passt with an appearance of
working IPv4 and IPv6 connectivity, contributed by David.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
This can be used for generic cell values with an arbitrary scale.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Instead of just disabling performance reports if running in demo
mode. This allows us to use table functions outside of performance
reports.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On ramfs, connecting to a non-existent UNIX domain socket yields
EACCESS, instead of ENOENT. This is visible if we use passt directly
on rootfs (a ramfs instance) from an initramfs image.
It's probably wrong for ramfs to return EACCES, but given the
simplicity of the filesystem, I doubt we should try to fix it there
at the possible cost of added complexity.
Also, this whole beauty should go away once qrap-less usage is
established, so just accept EACCES as indication that a conflicting
socket does not, in fact, exist.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>