1
0
mirror of https://passt.top/passt synced 2025-01-05 12:25:23 +00:00
Commit Graph

1302 Commits

Author SHA1 Message Date
David Gibson
515db1ecc4 tcp_splice: Simplify selection of socket and pipe sides in socket handler
tcp_splice_sock_handler() uses the tcp_splice_dir() helper to select
which of the socket, pipe and counter fields to use depending on which
side of the connection the socket event is coming from.

Now that we are using arrays for the two sides, rather than separate named
fields, we can instead just use a variable indicating the side and use
that to index the arrays whever we need a particular side's field.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:21 +01:00
David Gibson
7486cd13af tcp_splice: Exploit side symmetry in tcp_splice_destroy()
tcp_splice_destroy() has some close-to-duplicated logic handling closing of
the socket and pipes for each side of the connection.  We can use a loop
across the sides to reduce the duplication.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:19 +01:00
David Gibson
69db3b3b29 tcp_splice: Exploit side symmetry in tcp_splice_connect_finish()
tcp_splice_connect_finish() has two very similar blocks opening the two
pipes for each direction of the connection.  We can deduplicate this with
a loop across the two sides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:15 +01:00
David Gibson
1b76257147 tcp_splice: Exploit side symmetry in tcp_splice_timer()
tcp_splice_timer() has two very similar blocks one after another that
handle the SO_RCVLOWAT flags for the two sides of the connection.  We can
deduplicate this with a loop across the two sides.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:13 +01:00
David Gibson
8545058fbe tcp_splice: Rename sides of connection from a/b to 0/1
Each spliced connection has two mostly, although not entirely, symmetric
sides.  We currently call those "a" and "b" and have different fields in
the connection structure for each one.

We can better exploit that symmetry if we use two element arrays rather
thatn separately named fields.  Do that in the places we can, and for the
others change the "a"/"b" terminology to 0/1 to match.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:10 +01:00
David Gibson
0e8e534850 tcp_splice: Don't pool pipes in pairs
To reduce latencies, the tcp splice code maintains a pool of pre-opened
pipes to use for new connections.  This is structured as an array of pairs
of pipes, with each pipe, of course, being a pair of fds.  Thus when we
use the pool, a single pool "slot" provides both the a->b and b->a pipes.

There's no strong reason to store the pool in pairs, though - we can
with not much difficulty instead take the a->b and b->a pipes for a new
connection independently from separate slots in the pool, or even take one
from the the pool and create the other as we need it, if there's only one
pipe left in the pool.

This marginally increases the length of code, but simplifies the structure
of the pipe pool.  We should be able to re-shrink the code with later
changes, too.

In the process we also fix some minor bugs:
- If we both failed to find a pipe in the pool and to create a new one, we
  didn't log an error and would silently drop the connection.  That could
  make debugging such a situation difficult.  Add in an error message for
  that case
- When refilling the pool, if we were only able to open a single pipe in
  the pair, we attempted to rollback, but instead of closing the opened
  pipe, we instead closed the pipe we failed to open (probably leading to
  some ignored EBADFD errors).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:06 +01:00
David Gibson
6357010cab tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl()
We initialise the events_a and events_b variables with
tcp_splice_conn_epoll_events() function, then immediately copy the values
into ev_a.events and ev_b.events.  We can't simply pass &ev_[ab].events to
tcp_splice_conn_epoll_events(), because struct epoll_event is packed,
leading to 'pointer may be unaligned' warnings if we attempt that.

We can, however, make tcp_splice_conn_epoll_events() take struct
epoll_event pointers rather than raw u32 pointers, avoiding the awkward
temporaries.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:03 +01:00
David Gibson
409d3ca87f tcp_splice: Remove unnecessary forward declaration
In tcp_splice.c we forward declare tcp_splice_epoll_ctl() then define it
later on.  However, there are no circular dependencies which prevent us
from simply having the full definition in place of the forward declaration.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:00 +01:00
David Gibson
5a79ba6272 tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl()
tcp_splice_epoll_ctl() removes both sockets from the epoll set if called
when conn->flags & CLOSING.  This will always happen immediately after
setting that flag, since conn_flag_do() makes the call itself.  That's also
the _only_ time it can happen: we perform the EPOLL_CTL_DEL without
clearing the conn->in_epoll flag, meaning that any further calls to
tcp_splice_epoll_ctl() would attempt EPOLL_CTL_MOD, which would necessarily
fail since the fds are no longer in the epoll.

The EPOLL_CTL_DEL path in tcp_splice_epoll_ctl() has essentially zero
overlap with anything else the function does, so just move them to be
open coded in conn_flag_do().

This does require kernel 2.6.9 or later, in order to pass NULL as the
event structure for epoll_ctl().  However, we already require at least
3.13 to allow unprivileged user namespaces.

Given that, simply directly perform the EPOLL_CTL_DEL operations from
conn_flag_do() rather than unnecessarily multiplexini

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:58 +01:00
David Gibson
536acab2de tcp_splice: Correct error handling in tcp_splice_epoll_ctl()
If we get an error from epoll_ctl() in tcp_splice_epoll_ctl() we goto the
'delete' path where we remove both sockets from the epoll set and return
an error.  There are several problems with this:

- We 'return -errno' after the EPOLL_CTL_DEL operations, which means the
  deleting epoll_ctl() calls may have overwritten the errno values which
  actually triggered the failures.

- The call from conn_flag_do() occurs when the CLOSING flag is set, in
  which case we go do the delete path regardless of error.  In that case
  the 'return errno' is meaningless since we don't expect the EPOLL_CTL_DEL
  operations to fail and we ignore the return code anyway.

- All other calls to tcp_splice_epoll_ctl() check the return code and if
  non-zero immediately call conn_flag(..., CLOSING) which will call
  tcp_splice_epoll_ctl() again explicitly to remove the sockets from epoll.
  That means removing them when the error first occurs is redundant.

- We never specifically report an error on the epoll_ctl() operations.  We
  just set the connection to CLOSING, more or less silently killing it.
  This could make debugging difficult in the unlikely even that we get a
  failure here.

Re-organise tcp_splice_epoll_ctl() to just log a message then return in the
error case, and only EPOLL_CTL_DEL when explicitly asked to with the
CLOSING flag.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:54 +01:00
David Gibson
d33cbc600e tcp_splice: Remove redundant tcp_splice_epoll_ctl()
tcp_splice_conn_update() calls tcp_splice_epoll_ctl() twice: first ignoring
the return value, then checking it.  This serves no purpose. If the first
call succeeds, the second call will do exactly the same thing again, since
nothing has changed in conn.  If the first call fails, then
tcp_splice_epoll_ctl() itself will EPOLL_CTL_DEL both fds, meaning when
the second call tries to EPOLL_CTL_MOD them it will necessarily fail.

It appears that this duplication was introduced by accident in an
otherwise unrelated patch.

Fixes: bb708111 ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:46 +01:00
David Gibson
f6d8dc2355 pif: Pass originating pif to tap handler functions
For now, packets passed to the various *_tap_handler() functions always
come from the single "tap" interface.  We want to allow the possibility to
broaden that in future.  As preparation for that, have the code in tap.c
pass the pif id of the originating interface to each of those handler
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:45 +01:00
David Gibson
732e249376 pif: Record originating pif in listening socket refs
For certain socket types, we record in the epoll ref whether they're
sockets in the namespace, or on the host.  We now have the notion of "pif"
to indicate what "place" a socket is associated with, so generalise the
simple one-bit 'ns' to a pif id.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:41 +01:00
David Gibson
125c5e52a5 pif: Introduce notion of passt/pasta interface
We have several possible ways of communicating with other entities.  We use
sockets to communicate with the host and other network sites, but also in
a different context to communicate "spliced" channels to a namespace.  We
also use a tuntap device or a qemu socket to communicate with the namespace
or guest.

For the time being these are just defined implicitly by how we structure
things.  However, there are other communication channels we want to use in
future (e.g. virtio-user), and we want to allow more flexible forwarding
between those.  To accomplish that we're going to want a specific way of
referring to those channels.

Introduce the concept of a "passt/pasta interface" or "pif" representing a
specific channel to communicate network data.  Each pif is assumed to be
associated with a specific network namespace in the broad sense (that is
as a place where IP addresses have a consistent meaning - not the Linux
specific sense).  But there could be multiple pifs communicating with the
same namespace (e.g. the spliced and tap interfaces in pasta).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:38 +01:00
David Gibson
0d73fa2225 udp: Clean up ref initialisation in udp_sock_init()
udp_sock_init() has a number of paths that initialise uref differently.
However some of the fields are initialised the same way in all of them.
Move those fields into the original initialiser to save a few lines.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:35 +01:00
David Gibson
c09d0d0f60 port_fwd: Simplify get_bound_ports_*() to port_fwd_scan_*()
get_bound_ports_*() now only use their context and ns parameters to
determine which forwarding maps they're operating on.  Each function needs
the map they're actually updating, as well as the map for the other
direction, to avoid creating forwarding loops.  The UDP function also
requires the corresponding TCP map, to implement the behaviour where we
forward UDP ports of the same number as bound TCP ports for tools like
iperf3.

Passing those maps directly as parameters simplifies the code without
making the callers life harder, because those already know the relevant
maps.  IMO, invoking these functions in terms of where they're looking for
updated forwarding also makes more logical sense than in terms of where
they're looking for bound ports.  Given that new way of looking at the
functions, also rename them to port_fwd_scan_*().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:31 +01:00
David Gibson
dcf5c0eb1e port_fwd: Move port scanning /proc fds into struct port_fwd
Currently we store /proc/net fds used to implement automatic port
forwarding in the proc_net_{tcp,udp} fields of the main context structure.
However, in fact each of those is associated with a particular direction
of forwarding, and we already have struct port_fwd which collects all
other information related to a particular direction of port forwarding.

We can simplify things a bit by moving the /proc fds into struct port_fwd.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:29 +01:00
David Gibson
1a40d00895 port_fwd: Split TCP and UDP cases for get_bound_ports()
Currently get_bound_ports() takes a parameter to determine if it scans for
UDP or TCP bound ports, but in fact there's almost nothing in common
between those two paths.  The parameter appears primarily to have been
a convenience for when we needed to invoke this function via NS_CALL().

Now that we don't need that, split it into separate TCP and UDP versions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:26 +01:00
David Gibson
180dbc957a port_fwd: Don't NS_CALL get_bound_ports()
When we want to scan for bound ports in the namespace we use NS_CALL() to
run get_bound_ports() in the namespace.  However, the only thing it
actually needed to be in the namespace for was to open the /proc/net file
it was scanning.  Since we now always pre-open those, we no longer need
to switch to the namespace for the actual get_bound_ports() calls.

That in turn means that tcp_port_detect() doesn't need to run in the ns
either, and we can just replace it with inline calls to get_bound_ports().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:24 +01:00
David Gibson
5a0485425b port_fwd: Pre-open /proc/net/* files rather than on-demand
procfs_scan_listen() can either use an already opened fd for a /proc/net
file, or it will open it.  So, effectively it will open the file on the
first call, then re-use the fd in subsequent calls.  However, it's not
possible to open the /proc/net files after we isolate our filesystem in
isolate_prefork().  That means that for each /proc/net file we must call
procfs_scan_listen() at least once before isolate_prefork(), or it won't
work afterwards.

That happens to be the case, but it's a pretty fragile requirement.  To
make this more robust, instead always pre-open the /proc files we will need
in get_bounds_port_init() and have procfs_scan_listen() just use those.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:21 +01:00
David Gibson
4f0b9f91e4 util: Add open_in_ns() helper
Most of our helpers which need to enter the pasta network namespace are
quite specialised.  Add one which is rather general - it just open()s a
given file in the namespace context and returns the fd back to the main
namespace.  This will have some future uses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:18 +01:00
David Gibson
17d40d1cb5 port_fwd: Better parameterise procfs_scan_listen()
procfs_scan_listen() does some slightly clunky logic to deduce the fd it
wants to use, the path it wants to open and the state it's looking for
based on parameters for protocol, IP version and whether we're in the
namespace.

However, the caller already has to make choices based on similar parameters
so it can just pass in the things that procfs_scan_listen() needs directly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:16 +01:00
David Gibson
e90f2770ae port_fwd: Move automatic port forwarding code to port_fwd.[ch]
The implementation of scanning /proc files to do automatic port forwarding
is a bit awkwardly split between procfs_scan_listen() in util.c,
get_bound_ports() and related functions in conf.c and the initial setup
code in conf().

Consolidate all of this into port_fwd.h, which already has some related
definitions, and a new port_fwd.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:14 +01:00
David Gibson
26d86f1304 conf: Cleaner initialisation of default forwarding modes
Initialisation of the forwarding mode variables is complicated a bit by the
fact that we have different defaults for passt and pasta modes.  This leads
to some debateably duplicated code between those two cases.

More significantly, however, currently the final setting of the mode
variable is interleaved with the code to actually start automatic scanning
when that's selected.  This essentially mingles "policy" code (setting the
default mode), with implementation code (making that happen).  That's a bit
inflexible if we want to change policies in future.

Disentangle these two pieces, and use a slightly different construction to
make things briefer as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:52:59 +01:00
Stefano Brivio
3fb3f0f7a5 selinux: Drop user_namespace class rules for Fedora 37
With current selinux-policy-37.22-1.fc37.noarch, and presumably any
future update for Fedora 37, the user_namespace class is not
available, so statements using it prevent the policy from being
loaded.

If a class is not defined in the base policy, any related permission
is assumed to be enabled, so we can safely drop those.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2237996
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:52:55 +01:00
Stas Sergeev
f851084c96 dhcp: put option 53 at the beginning
... unless it is listed in 55.
Many clients expect option 53 at the beginning.
mTCP has this code:
  if ( resp->options[0] != 53 ) {
    TRACE_WARN(( "Dhcp: first option was not a Dhcp msg type\n" ));
    return;
  }

wattcp32 has this:
static int DHCP_is_ack (void)
{
  const BYTE *opt = (const BYTE*) &dhcp_in.dh_opt[4];

  return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_ACK);
}
static int DHCP_is_nack (void)
{
  const BYTE *opt = (const BYTE*) &dhcp_in.dh_opt[4];

  return (opt[0] == DHCP_OPT_MSG_TYPE && opt[1] == 1 && opt[2] == DHCP_NAK);
}

Link: https://bugs.passt.top/show_bug.cgi?id=77
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
[sbrivio: s/options 53/option 53/ and s/other/others/ in comment]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:39:58 +02:00
Stefano Brivio
a469fc393f tcp, tap: Don't increase tap-side sequence counter for dropped frames
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).

It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.

But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.

Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:39:58 +02:00
Stefano Brivio
d3192f67c4 tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:

- an iperf3 client on the host sends about 160 KiB, typically
  segmented into five frames by passt. We read this data using
  MSG_PEEK

- the iperf3 server on the guest starts receiving

- meanwhile, the host kernel advertised a zero-sized window to the
  sender, as expected

- eventually, the guest acknowledges all the data sent so far, and
  we drop it from the buffer, courtesy of tcp_sock_consume(), using
  recv() with MSG_TRUNC

- the client, however, doesn't get an updated window value, and
  even keepalive packets are answered with zero-window segments,
  until the connection is closed

It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).

Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the sender.

I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).

Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:27:15 +02:00
Stefano Brivio
feaeb4986c tcp: Fix comment to tcp_sock_consume()
Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.

Fixes: cc6d8286d1 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:24:08 +02:00
David Gibson
117b474f85 cppcheck: Work around bug in cppcheck 2.12.0
cppcheck 2.12.0 (and maybe some other versions) things this if condition
is always true, which is demonstrably not true.  Work around the bug for
now.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:24:05 +02:00
David Gibson
8640d62af7 cppcheck: Use "exhaustive" level checking when available
Recent enough cppcheck versions (at least as of cppcheck 2.12) give this
error processing conf.c:

conf.c:1179:1: information: ValueFlow analysis is limited in conf. Use --check-level=exhaustive if full analysis is wanted. [checkLevelNormal]

Adding --check-level=exhaustive doesn't seem to significantly increase the
cppcheck run time for us, so enable it when possible, suppressing that
warning.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:24:00 +02:00
David Gibson
c0efa4e97f conf: Remove overly cryptic selection of forward table
In a couple of places in conf(), we use a local 'fwd' variable to reference
one of our forwarding tables.  The value depends on which command line
option we're currently looking at, and is initialized rather cryptically
from an assignment side-effect within the if condition checking that
option.

Newer versions of cppcheck complain about this assignment being an always
true condition, but in any case it's both clearer and slightly shorter to
use separate if branches for the two cases and set the forwarding parameter
more directly.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:56 +02:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
fc8f0f8c48 siphash: Use incremental rather than all-at-once siphash functions
We have a bunch of variants of the siphash functions for different data
sizes.  The callers, in tcp.c, need to pack the various values they want to
hash into a temporary structure, then call the appropriate version.  We can
avoid the copy into the temporary by directly using the incremental
siphash functions.

The length specific hash functions also have an undocumented constraint
that the data pointer they take must, in fact, be aligned to avoid
unaligned accesses, which may cause crashes on some architectures.

So, prefer the incremental approach and remove the length-specific
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:53 +02:00
David Gibson
04b10a8d90 siphash, checksum: Move TBAA explanation to checksum.c
A number of checksum and hash functions require workarounds for the odd
behaviour of Type-Baased Alias Analysis.  We have a detailed comment about
this on siphash_8b() and other functions reference that.

Move the main comment to csume_16b() instead, because we're going to
reorganise things in siphash.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:51 +02:00
David Gibson
ceae8422c1 siphash: Make internal helpers public
Move a bunch of code from siphash.c to siphash.h, making it available to
other modules.  This will allow places which need hashes of more complex
objects to construct them incrementally.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:50 +02:00
David Gibson
cbc84df47f siphash: Use specific structure for internal state
To improve type safety, encapsulate the internal state of the SipHash
algorithm into a dedicated structure type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:48 +02:00
David Gibson
fcec3f6f9d siphash: Use more hygienic state initialiser
The PREAMBLE macro sets up the SipHash initial internal state.  It also
defines that state as a variable, which isn't macro hygeinic.  With
previous changes simplifying this premable, it's now possible to replace it
with a macro which simply expands to a C initialisedrfor that state.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:46 +02:00
David Gibson
5cc843521d siphash: Fix bug in state initialisation
The SipHash algorithm starts with initializing the 32 bytes of internal
state with some magic numbers XORed with the hash key.  However, our
implementation has a bug - rather than XORing the hash key, it *sets* the
initial state to copies of the key.

I don't know if that affects any of the cryptographic properties of SipHash
but it's not what we should be doing.  Fix it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:43 +02:00
David Gibson
831067f483 siphash: Clean up hash finalisation with posthash_final() function
The POSTAMBLE macro implements the finalisation steps of SipHash.  It
relies on some variables in the environment, including returning the final
hash value that way.  That isn't great hygeine.

In addition the PREAMBLE macro takes a length parameter which is used only
to initialize the 'b' value that's not used until the finalisation and is
also sometimes modified in a non-obvious way by the callers.

The 'b' value is always composed from the total length of the hash input
plus up to 7 bytes of "tail" data - that is the remainder of the hash input
after a multiple of 8 bytes has been consumed.

Simplify all this by replacing the POSTAMBLE macro with a siphash_final()
function which takes the length and tail data as parameters and returns the
final hash value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:40 +02:00
David Gibson
7a3153cbfb siphash: Add siphash_feed() helper
We have macros or inlines for a number of common operations in the siphash
functions.  However, in a number of places we still open code feeding
another 64-bits of data into the hash function: an xor, followed by 2
rounds of shuffling, followed by another xor.

Implement an inline function for this, which results in somewhat shortened
code.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:37 +02:00
David Gibson
f7b2be2d21 siphash: Make sip round calculations an inline function rather than macro
The SIPROUND(n) macro implements n rounds of SipHash shuffling.  It relies
on 'v' and '__i' variables being available in the context it's used in
which isn't great hygeine.  Replace it with an inline function instead.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:34 +02:00
David Gibson
ca6e94702c siphash: Make siphash functions consistently return 64-bit results
Some of the siphas_*b() functions return 64-bit results, others 32-bit
results, with no obvious pattern.  siphash_32b() also appears to do this
incorrectly - taking the 64-bit hash value and simply returning it
truncated, rather than folding the two halves together.

Since SipHash proper is defined to give a 64-bit hash, make all of them
return 64-bit results.  In the one caller which needs a 32-bit value,
tcp_seq_init() do the fold down to 32-bits ourselves.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:30 +02:00
David Gibson
c1d2a070f2 util: Consolidate and improve workarounds for clang-tidy issue 58992
We have several workarounds for a clang-tidy bug where the checker doesn't
recognize that a number of system calls write to - and therefore initialise
- a socket address.  We can't neatly use a suppression, because the bogus
warning shows up some time after the actual system call, when we access
a field of the socket address which clang-tidy erroneously thinks is
uninitialised.

Consolidate these workarounds into one place by using macros to implement
wrappers around affected system calls which add a memset() of the sockaddr
to silence clang-tidy.  This removes the need for the individual memset()
workarounds at the callers - and the somewhat longwinded explanatory
comments.

We can then use a #define to not include the hack in "real" builds, but
only consider it for clang-tidy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:26:06 +02:00
David Gibson
5b6c68c2e4 Avoid shadowing index(3)
A classic gotcha of the standard C library is that its unwise to call any
variable 'index' because it will shadow the standard string library
function index(3).  This can cause warnings from cppcheck amongst others,
and it also means that if the variable is removed you tend to get confusing
type errors (or sometimes nothing at all) instead of a nice simple "name is
not defined" error.

Strictly speaking this only occurs if <string.h> is included, but that
is so common that as a rule it's best to just avoid it always.  We
have a number of places which hit this trap, so rename variables and
parameters to avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:25:51 +02:00
Stefano Brivio
9178a9e346 tcp: Always send an ACK segment once the handshake is completed
The reporter is running a SMTP server behind pasta, and the client
waits for the server's banner before sending any data. In turn, the
server waits for our ACK after sending SYN,ACK, which never comes.

If we use the ACK_IF_NEEDED indication to tcp_send_flag(), given that
there's no pending data, we delay sending the ACK segment at the end
of the three-way handshake until we have some data to send to the
server.

This was actually intended, as I thought we would lower the latency
for new connections, but we can't assume that the client will start
sending data first (SMTP is the typical example where this doesn't
happen).

And, trying out this patch with SSH (where the client starts sending
data first), the reporter actually noticed we have a lower latency
by forcing an ACK right away. Comparing a capture before the patch:

13:07:14.007704 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [S], seq 1797034836, win 65535, options [mss 4096,nop,wscale 7], length 0
13:07:14.007769 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [S.], seq 2297052481, ack 1797034837, win 65480, options [mss 65480,nop,wscale 7], length 0
13:07:14.008462 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 65535, length 21
13:07:14.008496 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [.], ack 22, win 512, length 0
13:07:14.011799 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [P.], seq 1:515, ack 22, win 512, length 514

and after:

13:10:26.165364 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [S], seq 4165939595, win 65535, options [mss 4096,nop,wscale 7], length 0
13:10:26.165391 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [S.], seq 985607380, ack 4165939596, win 65480, options [mss 65480,nop,wscale 7], length 0
13:10:26.165418 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], ack 1, win 512, length 0
13:10:26.165683 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 512, length 21
13:10:26.165698 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [.], ack 22, win 512, length 0
13:10:26.167107 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [P.], seq 1:515, ack 22, win 512, length 514

the latency between the initial SYN segment and the first data
transmission actually decreases from 792µs to 334µs. This is not
statistically relevant as we have a single measurement, but it can't
be that bad, either.

Reported-by: cr3bs (from IRC)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-27 17:25:30 +02:00
Stefano Brivio
c09069211a dhcp: Actually note down the length of options received by the client
It turns out we never used 'clen' until commit 1f24d3efb4 ("dhcp:
support BOOTP clients"), and we always ignored option 55 (Parameter
Request List), while, according to RFC 2132, we MUST try to insert
the requested options in the order requested by the client.

The commit mentioned above made this visible because now every client
is reported as sending a DHCPREQUEST as an old BOOTP client, based on
the lack of option 53 (that is, zero length).

Fixes: b439984641 ("merd: ARP and DHCP handlers, connection tracking fixes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-27 17:25:22 +02:00
Stefano Brivio
8b8537d301 dhcpv6: Properly separate domain names in search list
To prepare the DHCPv6 domain search list option, we go over the
flattened list of domains, and replace both dots and zero bytes with a
counter of bytes in the next label, implementing the encoding
specified by section 3.1 of RFC 1035.

If there are multiple domains in the list, however, zero bytes serve
as markers for the end of a domain name, and we'll replace them with
the length of the first label of the next domain, plus one. This is
wrong. We should only convert the dots before the labels.

To distinguish between label separators and domain names separators,
for simplicity, introduce a dot before the first label of every
domain we copy to form the list. All dots are then replaced by label
lengths, and separators (zero bytes) remain as they are.

As we do this, we need to make sure we don't replace the trailing
dot, if present: that's already a separator. Skip copying it, and
just add separators as needed.

Now that we don't copy those, though, we might end up with
zero-length domains: skip them, as they're meaningless anyway.

And as we might skip domains, we can't use the index 'i' to check if
we're at the beginning of the option -- use 'srch' instead.

This is very similar to how we prepare the list for NDP option 31,
except that we don't need padding (RFC 8106, 5.2) here, and we should
refactor this into common functions, but it probably makes sense to
rework the NDP responder (https://bugs.passt.top/show_bug.cgi?id=21)
first.

Reported-by: Sebastian Mitterle <smitterl@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=75
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:21:00 +02:00
Stefano Brivio
05627dc512 util: Fix licensing information display in --version
The regular expression I used when relicensing to GPLv2+ missed this.

Fixes: ca2749e1bd ("passt: Relicense to GPL 2.0, or any later version")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 17:34:27 +02:00
David Gibson
46f915ddee tcp: Correct handling of FIN,ACK followed by SYN
When the guest tries to establish a connection, it could give up on it by
sending a FIN,ACK instead of a plain ACK to our SYN,ACK.  It could then
make a new attempt to establish a connection with the same addresses and
ports with a new SYN.

Although it's unlikely, it could send the 2nd SYN very shortly after the
FIN,ACK resulting in both being received in the same batch of packets from
the tap interface.

Currently, we don't handle that correctly, when we receive a FIN,ACK on a
not fully established connection we discard the remaining packets in the
batch, and so will never process the 2nd SYN.  Correct this by returning
1 from tcp_tap_handler() in this case, so we'll just consume the FIN,ACK
and continue to process the rest of the batch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:22 +02:00