1
0
mirror of https://passt.top/passt synced 2024-12-22 21:55:22 +00:00
Commit Graph

368 Commits

Author SHA1 Message Date
David Gibson
2d0e0084b6 tap: Extend tap_send_frames() to allow multi-buffer frames
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer.  We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-14 16:57:28 +01:00
Laurent Vivier
bb11d15495 tcp: Introduce tcp_fill_headers4()/tcp_fill_headers6()
Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function
tcp_fill_header().

Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to
tcp_fill_headers4() and tcp_fill_headers6()

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-10-lvivier@redhat.com>
[dwg: Correct commit message with new function names]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:52 +01:00
Laurent Vivier
6b22e10a26 tap: make tap_update_mac() generic
Use ethhdr rather than tap_hdr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-9-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:49 +01:00
Laurent Vivier
7df624e79a checksum: introduce functions to compute the header part checksum for TCP/UDP
The TCP and UDP checksums are computed using the data in the TCP/UDP
payload but also some informations in the IP header (protocol,
length, source and destination addresses).

We add two functions, proto_ipv4_header_psum() and
proto_ipv6_header_psum(), to compute the checksum of the IP
header part.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-8-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:47 +01:00
Laurent Vivier
feb4900c25 checksum: use csum_ip4_header() in udp.c and tcp.c
We can find the same function to compute the IPv4 header
checksum in tcp.c, udp.c and tap.c

Use the function defined for tap.c, csum_ip4_header(), but
with the code used in tcp.c and udp.c as it doesn't need a fully
initialiazed IPv4 header, only protocol, tot_len, saddr and daddr.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-7-lvivier@redhat.com>
[dwg: Fix weird cppcheck regression; it appears to be a problem
 in pre-existing code, but somehow this patch is exposing it]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:44 +01:00
Laurent Vivier
324bd46782 util: move IP stuff from util.[ch] to ip.[ch]
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures.  Modify various files to include the new header ip.h when
it's needed.

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-06 08:03:38 +01:00
David Gibson
3b9098aa49 fwd: Rename port_fwd.[ch] and their contents
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding.  We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table.  This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:27 +01:00
David Gibson
3b59b9748a tcp: Validate TCP endpoint addresses
TCP connections should typically not have wildcard addresses (0.0.0.0
or ::) nor a zero port number for either endpoint.  It's not entirely
clear (at least to me) if it's strictly against the RFCs to do so, but
at any rate the socket interfaces often treat those values
specially[1], so it's not really possible to manipulate such
connections.  Likewise they should not have broadcast or multicast
addresses for either endpoint.

However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be.
To ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level,
because that would allow a guest to spam the logs).

We never expect such an address on an accept()ed socket either, but
just in case, check for it as well.

[1] Depending on context as "unknown", "match any" or "kernel, pick
    something for me"

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:17 +01:00
David Gibson
dc9a5d71e9 tcp, tcp_splice: Parse listening socket epoll ref in tcp_listen_handler()
tcp_listen_handler() uses the epoll reference for the listening socket
it handles, and also passes on one variant of it to
tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock().  The latter
two functions only need a couple of specific fields from the
reference.

Pass those specific values instead of the whole reference, which
localises the handling of the listening (as opposed to accepted)
socket and its reference entirely within tcp_listen_handler().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:15 +01:00
David Gibson
ee677e0a42 tcp_splice: Improve logic deciding when to splice
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.

 * Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
   check for pasta mode within it, better localising the checks.
 * Previously if we got a connection from a non-loopback address we'd
   always fall back to the "tap" path, even if the  connection was on a
   socket in the namespace.  If we did get a non-loopback address on a
   namespace socket, something has gone wrong and the "tap" path certainly
   won't be able to handle it.  Report the error and close, rather than
   passing it along to tap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:13 +01:00
David Gibson
0f938c3b9a flow: Clarify flow entry life cycle, introduce uniform logging
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.

In addition, make a FLOW_START() macro to mark one of the important
transitions.  This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it.  It
also logs that a new flow has been created, which is useful for
debugging.

We also add logging when a flow's lifecycle ends.  This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:01 +01:00
David Gibson
bb9bf0bb8f tcp, udp: Don't precompute port remappings in epoll references
The epoll references for both TCP listening sockets and UDP sockets
includes a port number.  This gives the destination port that traffic
to that socket will be sent to on the other side.  That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.

As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address).  So,
it will no longer make sense to have a fixed value for a listening
socket.

Change to simpler semantics where this field in the reference gives
the bound port of the socket.  We apply the translations to the
correct destination port later on, when we're actually forwarding.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:40 +01:00
David Gibson
f6e6e8ad40 inany: Introduce union sockaddr_inany
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6.  In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address.  Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.

Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6.  Use it in a number of
places where it's easy to do so.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:31 +01:00
David Gibson
0cf6b2d89d inany: Provide more conveniently typed constants for special addresses
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely.  We don't yet
provide constants of this type for special addresses (loopback and
"any").  Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.

To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any.  For the IPv4 versions we need to create
new constant globals.

For complicated historical reasons, the standard library doesn't
provide constants for IPv4 loopback and any addresses as struct
in_addr.  It just has macros of type in_addr_t == uint32_t, which has
some gotchas w.r.t. endianness.  We can use some more macros to
address this lack, using macros to effectively create these IPv4
constants as pieces of the inany constants above.

We use this last to avoid some awkward temporary variables just used
to get an address of an IPv4 loopback address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:47:28 +01:00
David Gibson
e5e6f29459 tcp: Don't store errnos in socket pool
If tcp_sock_refill_pool() gets an error opening new sockets, it stores the
negative errno of that error in the socket pool.  This isn't especially
useful:
  * It's inconsistent with the initial state of the pool (all -1)
  * It's inconsistent with the state of an entry that was valid and was
    then consumed (also -1)
  * By the time we did anything with this error code, it's now far removed
    from the situation in which the error occurred, making it difficult to
    report usefully

We now have error reporting closer to when failures happen on the refill
paths, so just leave a pool slot we can't fill as -1.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:53:30 +01:00
David Gibson
fe27ebce5c tcp, tcp_splice: Helpers for getting sockets from the pools
We maintain pools of ready-to-connect sockets in both the original and
(for pasta) guest namespace to reduce latency when starting new TCP
connections.  If we exhaust those pools we have to take a higher
latency path to get a new socket.

Currently we open-code that fallback in the places we need it.  To improve
clarity encapsulate that into helper functions.  While we're at it, give
those helpers clearer error reporting.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:46 +01:00
David Gibson
fbe81decbd tcp, tcp_splice: Issue warnings if unable to refill socket pool
Currently if tcp_sock_refill_pool() is unable to fill all the slots in the
pool, it will silently exit.  This might lead to a later attempt to get
fds from the pool to fail at which point it will be harder to tell what
originally went wrong.

Instead add warnings if we're unable to refill any of the socket pools when
requested.  We have tcp_sock_refill_pool() return an error and report it
in the callers, because those callers have more context allowing for a
more useful message.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:44 +01:00
David Gibson
554b3aefe7 tcp: Stop on first error when refilling socket pools
Currently if we get an error opening a new socket while refilling a socket
pool, we carry on to the next slot and try again.  This isn't very useful,
since by far the most likely cause of an error is some sort of resource
exhaustion.  Trying again will probably just hit the same error, and maybe
even make things worse.

So, instead stop on the first error while refilling the pool, making do
with however many sockets we managed to open before the error.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:17 +01:00
David Gibson
af303fdbff tcp: Don't stop refilling socket pool if we find a filled entry
Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the
pool with a valid fd.  This appears to makes sense: we always use fds from
the front of the pool, so if we find a filled one, the rest of the pool
should be filled as well.

However, that's not quite correct.  If a previous refill hit errors trying
to open new sockets, it could leave gaps between blocks of valid fds. We're
going to add some changes that could make that more likely.

So, for robustness, instead skip over the filled entry but still try to
refill the rest of the array.  We expect simply iterating over the pool to
be of small cost compared to even a single system call, so this shouldn't
have much impact.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:14 +01:00
David Gibson
4e08d9b9c6 treewide: Use sa_family_t for address family variables
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int.  Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-27 12:52:02 +01:00
David Gibson
78901eeae3 tcp: Fix subtle bug in fast re-transmit path
When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack.  Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards.  Setting seq_ack_from_tap is
not correct, though.

In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above.  However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible.  In
that case this update will cause problems.

We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed).  We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest.  However tcp_sock_consume() failing means we weren't
able to disard the acked data.  To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough.  Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-11 18:30:01 +01:00
David Gibson
8981a720aa flow: Avoid moving flow entries to compact table
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table.  Doing this
sometimes requires moving an entry when one is freed.  That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.

Implement a new way of managing the flow table that doesn't ever move
entries.  It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
 Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:37 +01:00
David Gibson
9c0881d4f6 flow: Enforce that freeing of closed flows must happen in deferred handlers
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers.  Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers.  If true, this indicates that the flow
code itself should free the flow.

This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:33 +01:00
David Gibson
4a849e9526 flow: Abstract allocation of new flows with helper function
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections.  This duplication isn't ideal and will get worse as we
add more protocols to the flow table.  It also makes it harder to
experiment with different ways of handling flow table allocation.

Instead, introduce a function to allocate a new flow: flow_alloc().  In
some cases we currently check if we're able to allocate, but delay the
actual allocation.  We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation.  We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:31 +01:00
David Gibson
fb7c00169d flow: Move flow_count from context structure to a global
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'.  The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.

Move flow_count to be a regular global to match.  For now it needs to be
public, rather than static, but we expect to be able to change that in
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:29 +01:00
David Gibson
02e092b4fe tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection.  This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function.  We can avoid this by adding a new
epoll type and dispatching directly to the right thing.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:25 +01:00
David Gibson
36dfa8b8fb flow, tcp: Add handling for per-flow timers
tcp_timer() scans the flow table so that it can run tcp_splice_timer() on
each spliced connection.  More generally, other flow types might want to
run similar timers in future.

We could add a flow_timer() analagous to tcp_timer(), udp_timer() etc.
However, this would need to scan the flow table, which we would have just
done in flow_defer_handler().  We'd prefer to just scan the flow table
once, dispatching both per-flow deferred events and per-flow timed events
if necessary.

So, extend flow_defer_handler() to do this.  For now we use the same timer
interval for all flow types (1s).  We can make that more flexible in future
if we need to.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:19 +01:00
David Gibson
b43e4483ed flow, tcp: Add flow-centric dispatch for deferred flow handling
tcp_defer_handler(), amongst other things, scans the flow table and does
some processing for each TCP connection.  When we add other protocols to
the flow table, they're likely to want some similar scanning.  It makes
more sense for cache friendliness to perform a single scan of the flow
table and dispatch to the protocol specific handlers, rather than having
each protocol separately scan the table.

To that end, add a new flow_defer_handler() handling all flow-linked
deferred operations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:17 +01:00
David Gibson
c97bb527d6 tcp, tcp_splice: Move per-type cleanup logic into per-type helpers
tcp_conn_destroy() and tcp_splice_destroy() are always called conditionally
on the connection being closed or closing.  Move that logic into the
"destroy" functions themselves, renaming them tcp_flow_defer() and
tcp_splice_flow_defer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:15 +01:00
David Gibson
eebca1115f tcp, tcp_splice: Remove redundant handling from tcp_timer()
tcp_timer() scans the connection table, expiring "tap" connections and
calling tcp_splice_timer() for "splice" connections.  tcp_splice_timer()
expires spliced connections and then does some other processing.

However, tcp_timer() is always called shortly after tcp_defer_handler()
(from post_handler()), which also scans the flow table expiring both tap
and spliced connections.  So remove the redundant handling, and only do
the extra tcp_splice_timer() work from tcp_timer().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:13 +01:00
David Gibson
8563e7c870 treewide: Standardise on 'now' for current timestamp variables
In a number of places we pass around a struct timespec representing the
(more or less) current time.  Sometimes we call it 'now', and sometimes we
call it 'ts'.  Standardise on the more informative 'now'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:10 +01:00
David Gibson
17bbab1c97 flow: Make flow_table.h #include the protocol specific headers it needs
flow_table.h, the lower level flow header relies on having the struct
definitions for every protocol specific flow type - so far that means
tcp_conn.h.  It doesn't include it itself, so tcp_conn.h must be included
before flow_table.h.

That's ok for now, but as we use the flow table for more things,
flow_table.h will need the structs for all of them, which means the
protocol specific .c files would need to include tcp_conn.h _and_ the
equivalents for every other flow type before flow_table.h every time,
which is weird.

So, although we *mostly* lean towards the include style where .c files need
to handle the include dependencies, in this case it makes more sense to
have flow_table.h include all the protocol specific headers it needs.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:34:55 +01:00
David Gibson
a179ca6707 treewide: Make a bunch of pointer variables pointers to const
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-16 21:49:27 +01:00
Laurent Vivier
6a348cb435 tcp: make tcp_sock_set_bufsize() static (again)
e5eefe7743 ("tcp: Refactor to use events instead of states, split out
spliced implementation") has exported tcp_sock_set_bufsize() to
be able to use it in tcp_splice.c, but 6ccab72d9b has removed its use
in tcp_splice.c, so we can set it static again.

Fixes: 6ccab72d9b ("tcp: Improve handling of fallback if socket pool is empty on new splice")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:31:25 +01:00
David Gibson
546332786c treewide: Use IN4ADDR_LOOPBACK_INIT more widely
We already define IN4ADDR_LOOPBACK_INIT to initialise a struct in_addr to
the loopback address without delving into its internals.  However there are
some places we don't use it, and explicitly look at the internal structure
of struct in_addr, which we generally want to avoid.  Use the define more
widely to avoid that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
1f2aab8aaa tcp: Fix address type for tcp_sock_init_af()
This takes a struct in_addr * (i.e. an IPv4 address), although it's
explicitly supposed to handle IPv6 as well.  Both its caller and sock_l4()
which it calls use a void * for the address, which can be either an in_addr
or an in6_addr.

We get away with this, because we don't do anything with the pointer other
than transfer it from the caller to sock_l4(), but it's misleading.  And
quite possibly technically UB, because C is like that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5d5bb8c150 tcp: Don't account for hash table size in tcp_hash()
Currently tcp_hash() returns the hash bucket for a value, that is the hash
modulo the size of the hash table.  Usually it's a bit more flexible to
have hash functions return a "raw" hash value and perform the modulus in
the callers.  That allows the same hash function to be used for multiple
tables of different sizes, or to re-use the hash for other purposes.

We don't do anything like that with tcp_hash() at present, but we have some
plans to do so.  Prepare for that by making tcp_hash() and tcp_conn_hash()
return raw hash values.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
64e5459ba6 tcp: Implement hash table with indices rather than pointers
We implement our hash table with pointers to the entry for each bucket (or
NULL).  However, the entries are always allocated within the flow table,
meaning that a flow index will suffice, halving the size of the hash table.

For TCP, just a flow index would be enough, but future uses will want to
expand the hash table to cover indexing either side of a flow, so use a
flow_sidx_t as the type for each hash bucket.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
5913f26415 tcp: Switch hash table to linear probing instead of chaining
Currently we deal with hash collisions by letting a hash bucket contain
multiple entries, forming a linked list using an index in the connection
structure.

That's a pretty standard and simple approach, but in our case we can use
an even simpler one: linear probing.  Here if a hash bucket is occupied
we just move onto the next one until we find a feww one.  This slightly
simplifies lookup and more importantly saves some precious bytes in the
connection structure by removing the need for a link.  It does require some
additional complexity for hash removal.

This approach can perform poorly with hash table load is high.  However, we
already size our hash table of pointers larger than the connection table,
which puts an upper bound on the load.  It's relatively cheap to decrease
that bound if we find we need to.

I adapted the linear probing operations from Knuth's The Art of Computer
Programming, Volume 3, 2nd Edition.  Specifically Algorithm L and Algorithm
R in Section 6.4.  Note that there is an error in Algorithm R as printed,
see errata at [0].

[0] https://www-cs-faculty.stanford.edu/~knuth/all3-prepre.ps.gz

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
89fcb563fc tcp: Fix conceptually incorrect byte-order switch in tcp_tap_handler()
tcp_hash_lookup() expects the port numbers in host order, but the TCP
header, of course, has them in network order, so we need to switch them.
However we call htons() (host to network) instead of ntohs() (network to
host).  This works because those do the same thing in practice (they only
wouldn't on very strange theoretical platforms which are neither big nor
little endian).

But, having this the "wrong" way around is misleading, so switch it around.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
Stefano Brivio
fd29d62a9d tcp: Cast timeval fields to unsigned long long for printing
On x32, glibc defines time_t and suseconds_t (the latter, also known as
__syscall_slong_t) as unsigned long long, whereas "everywhere else",
including x86_64 and i686, those are unsigned long.

See also https://sourceware.org/bugzilla/show_bug.cgi?id=16437 for
all the gory details.

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-27 19:29:45 +01:00
David Gibson
b86afe3559 tcp: Don't defer hash table removal
When a TCP connection is closed, we mark it by setting events to CLOSED,
then some time later we do final cleanups: closing sockets, removing from
the hash table and so forth.

This does mean that when making a hash lookup we need to exclude any
apparent matches that are CLOSED, since they represent a stale connection.
This can happen in practice if one connection closes and a new one with the
same endpoints is started shortly afterward.

Checking for CLOSED is quite specific to TCP however, and won't work when
we extend the hash table to more general flows.  So, alter the code to
immediately remove the connection from the hash table when CLOSED, although
we still defer closing sockets and other cleanup.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:33 +01:00
David Gibson
e21b6d69b1 tcp: "TCP" hash secret doesn't need to be TCP specific
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.

We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret.  So move the hash_secret
to the general context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:32 +01:00
David Gibson
705549f834 flow,tcp: Use epoll_ref type including flow and side
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).

This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections.  In that case we
want to know which side of the connection the event is occurring on as
well as which connection.  At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.

When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.

Therefore add a new 'flowside' epoll_ref field, with exactly that
information.  We use it for TCP connected sockets.  This allows us to
directly know the side for spliced connections.  For "tap"
connections, it's pretty meaningless, since the side is always the
socket side.  It still makes logical sense though, and it may become
important for future flow table work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:24 +01:00
David Gibson
ecea8d36ff flow,tcp: Generalise TCP epoll_ref to generic flows
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets.  Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with.  As we expand the use of the flow table, we expect
that to be true for more epoll fds.  So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:20 +01:00
David Gibson
31bab5f2d9 tcp: Remove unneccessary bounds check in tcp_timer_handler()
In tcp_timer_handler() we use conn_at_idx() to interpret the flow index
from the epoll reference.  However, this will never be NULL - we always
put a valid index into the epoll_ref.  Simplify slightly based on this.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:17 +01:00
David Gibson
eb8b1a233b flow, tcp: Add logging helpers for connection related messages
Most of the messages logged by the TCP code (be they errors, debug or
trace messages) are related to a specific connection / flow.  We're fairly
consistent about prefixing these with the type of connection and the
connection / flow index.  However there are a few places where we put the
index later in the message or omit it entirely.  The template with the
prefix is also a little bulky to carry around for every message,
particularly for spliced connections.

To help keep this consistent, introduce some helpers to log messages
linked to a specific flow.  It takes the flow as a parameter and adds a
uniform prefix to each message.  This makes things slightly neater now, but
more importantly will help keep formatting consistent as we add more things
to the flow table.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:12 +01:00
David Gibson
96590b0560 flow: Make unified version of flow table compaction
tcp_table_compact() will move entries in the connection/flow table to keep
it compact when other entries are removed.  The moved entries need not have
the same type as the flow removed, so it needs to be able to handle moving
any type of flow.  Therefore, move it to flow.c rather than being
purportedly TCP specific.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:09 +01:00
David Gibson
e2e8219f13 flow, tcp: Consolidate flow pointer<->index helpers
Both tcp.c and tcp_splice.c define CONN_IDX() variants to find the index
of their connection structures in the connection table, now become the
unified flow table.  We can easily combine these into a common helper.
While we're there, add some trickery for some additional type safety.

They also define their own CONN() versions, which aren't so easily combined
since they need to return different types, but we can have them use a
common helper.

In the process, we standardise on always using an unsigned type to store
the connection / flow index, which makes more sense.  tcp.c's conn_at_idx()
remains for now, but we change its parameter to unsigned to match.  That in
turn means we can remove a check for negative values from it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:04 +01:00
David Gibson
f08ce92a13 flow, tcp: Move TCP connection table to unified flow table
We want to generalise "connection" tracking to things other than true TCP
connections.  Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c.  The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:02 +01:00
David Gibson
16ae032608 flow, tcp: Generalise connection types
Currently TCP connections use a 1-bit selector, 'spliced', to determine the
rest of the contents of the structure.  We want to generalise the TCP
connection table to other types of flows in other protocols.  Make a start
on this by replacing the tcp_conn_common structure with a new flow_common
structure with an enum rather than a simple boolean indicating the type of
flow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:50:59 +01:00
Laurent Vivier
d902bb6288 tcp: remove useless assignment
In tcp_send_flag(), a4826ee04b has replaced:

    th->doff = sizeof(*th) / 4;
    th->doff += OPT_MSS_LEN / 4;
    th->doff += (1 + OPT_WS_LEN) / 4;

by

    optlen = OPT_MSS_LEN + 1 + OPT_WS_LEN;
    th->doff = (sizeof(*th) + optlen) / 4;

but forgot to remove the useless "th->doff += (1 + OPT_WS_LEN) / 4;"

Fixes: a4826ee04b ("tcp: Defer and coalesce all segments with no data (flags) to handler")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:50:51 +01:00
Stefano Brivio
06559048e7 treewide: Use 'z' length modifier for size_t/ssize_t conversions
Types size_t and ssize_t are not necessarily long, it depends on the
architecture.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-12-02 03:54:42 +01:00
David Gibson
f7724647b1 valgrind: Adjust suppression for MSG_TRUNC with NULL buffer
valgrind complains if we pass a NULL buffer to recv(), even if we use
MSG_TRUNC, in which case it's actually safe.  For a long time we've had
a valgrind suppression for this.  It singles out the recv() in
tcp_sock_consume(), the only place we use MSG_TRUNC.

However, tcp_sock_consume() only has a single caller, which makes it a
prime candidate for inlining.  If inlined, it won't appear on the stack and
valgrind won't match the correct suppression.

It appears that certain compiler versions (for example gcc-13.2.1 in
Fedora 39) will inline this function even with the -O0 we use for valgrind
builds.  This breaks the suppression leading to a spurious failure in the
tests.

There's not really any way to adjust the suppression itself without making
it overly broad (we don't want to match other recv() calls).  So, as a hack
explicitly prevent inlining of this function when we're making a valgrind
build.  To accomplish this add an explicit -DVALGRIND when making such a
build.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:10:12 +01:00
David Gibson
4ccdeecb74 tcp: Simplify away tcp_port_rebind()
tcp_port_rebind() is desgined to be called from NS_CALL() and has two
disjoint cases: one where it enters the namespace (outbound forwards) and
one where it doesn't (inbound forwards).

We only actually need the NS_CALL() framing for the outbound case, for
inbound we can just call tcp_port_do_rebind() directly.  So simplify
tcp_port_rebind() to tcp_port_rebind_outbound(), allowing us to eliminate
an awkward parameters structure.

With that done we can safely rename tcp_port_do_rebind() to
tcp_port_rebind() for brevity.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:37 +01:00
David Gibson
1776e7af9b tcp: Use common helper for rebinding inbound and outbound ports
tcp_port_rebind() has two cases with almost but not quite identical code.
Simplify things a bit by factoring this out into a single parameterised
helper, tcp_port_do_rebind().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-19 09:08:32 +01:00
David Gibson
cf3eeba6c0 tcp: Don't use TCP_WINDOW_CLAMP
On the L2 tap side, we see TCP headers and know the TCP window that the
ultimate receiver is advertising.  In order to avoid unnecessary buffering
within passt/pasta (or by the kernel on passt/pasta's behalf) we attempt
to advertise that window back to the original sock-side sender using
TCP_WINDOW_CLAMP.

However, TCP_WINDOW_CLAMP just doesn't work like this.  Prior to kernel
commit 3aa7857fe1d7 ("tcp: enable mid stream window clamp"), it simply
had no effect on established sockets.  After that commit, it does affect
established sockets but doesn't behave the way we need:
  * It appears to be designed only to shrink the window, not to allow it to
    re-expand.
  * More importantly, that commit has a serious bug where if the
    setsockopt() is made when the existing kernel advertised window for the
    socket happens to be zero, it will now become locked at zero, stopping
    any further data from being received on the socket.

Since this has never worked as intended, simply remove it.  It might be
possible to re-implement the intended behaviour by manipulating SO_RCVBUF,
so we leave a comment to that effect.

This kernel bug is the underlying cause of both the linked passt bug and
the linked podman bug.  We attempted to fix this before with passt commit
d3192f67 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag").
However while that commit masked the bug for some cases, it didn't really
address the problem.

Fixes: d3192f67c4 ("tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag")
Link: https://github.com/containers/podman/issues/20170
Link: https://bugs.passt.top/show_bug.cgi?id=74
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-10 06:42:19 +01:00
David Gibson
930bc3b0f2 tcp: Rename and small cleanup to tcp_clamp_window()
tcp_clamp_window() is _mostly_ about using TCP_WINDOW_CLAMP to control the
sock side advertised window, but it is also responsible for actually
updating the conn->wnd_from_tap value.

Rename to tcp_tap_window_update() to reflect that broader purpose, and pull
the logic that's not TCP_WINDOW_CLAMP related out to the front.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-10 06:42:10 +01:00
David Gibson
5972203174 log: Enable format warnings
logmsg() takes printf like arguments, but because it's not a built in, the
compiler won't generate warnings if the format string and parameters don't
match.  Enable those by using the format attribute.

Strictly speaking this is a gcc extension, but I believe it is also
supported by some other common compilers.  We already use some other
attributes in various places.  For now, just use it and we can worry about
compilers that don't support it if it comes up.

This exposes some warnings from existing callers, both in gcc and in
clang-tidy:
 - Some are straight out bugs, which we correct
 - It's occasionally useful to invoke the logging functions with an empty
   string, which gcc objects to, so disable that specific warning in the
   Makefile
 - Strictly speaking the C standard requires that the parameter for a %p
   be a (void *), not some other pointer type.  That's only likely to cause
   problems in practice on weird architectures with different sized
   representations for pointers to different types.  Nonetheless add the
   casts to make it happy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:56 +01:00
Laurent Vivier
0ad54e1043 tcp: Remove remaining declaration of tcp_l2_mh
Use of tcp_l2_mh has been removed in commit 38fbfdbcb9,
but its declaration and initialization are always in the code.
Remove them as they are useless.

Fixes: 38fbfdbcb9 ("tcp: Get rid of iov with cached MSS, drop sendmmsg(), add deferred flush")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:54:23 +01:00
David Gibson
f6d8dc2355 pif: Pass originating pif to tap handler functions
For now, packets passed to the various *_tap_handler() functions always
come from the single "tap" interface.  We want to allow the possibility to
broaden that in future.  As preparation for that, have the code in tap.c
pass the pif id of the originating interface to each of those handler
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:45 +01:00
David Gibson
732e249376 pif: Record originating pif in listening socket refs
For certain socket types, we record in the epoll ref whether they're
sockets in the namespace, or on the host.  We now have the notion of "pif"
to indicate what "place" a socket is associated with, so generalise the
simple one-bit 'ns' to a pif id.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:41 +01:00
David Gibson
c09d0d0f60 port_fwd: Simplify get_bound_ports_*() to port_fwd_scan_*()
get_bound_ports_*() now only use their context and ns parameters to
determine which forwarding maps they're operating on.  Each function needs
the map they're actually updating, as well as the map for the other
direction, to avoid creating forwarding loops.  The UDP function also
requires the corresponding TCP map, to implement the behaviour where we
forward UDP ports of the same number as bound TCP ports for tools like
iperf3.

Passing those maps directly as parameters simplifies the code without
making the callers life harder, because those already know the relevant
maps.  IMO, invoking these functions in terms of where they're looking for
updated forwarding also makes more logical sense than in terms of where
they're looking for bound ports.  Given that new way of looking at the
functions, also rename them to port_fwd_scan_*().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:31 +01:00
David Gibson
1a40d00895 port_fwd: Split TCP and UDP cases for get_bound_ports()
Currently get_bound_ports() takes a parameter to determine if it scans for
UDP or TCP bound ports, but in fact there's almost nothing in common
between those two paths.  The parameter appears primarily to have been
a convenience for when we needed to invoke this function via NS_CALL().

Now that we don't need that, split it into separate TCP and UDP versions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:26 +01:00
David Gibson
180dbc957a port_fwd: Don't NS_CALL get_bound_ports()
When we want to scan for bound ports in the namespace we use NS_CALL() to
run get_bound_ports() in the namespace.  However, the only thing it
actually needed to be in the namespace for was to open the /proc/net file
it was scanning.  Since we now always pre-open those, we no longer need
to switch to the namespace for the actual get_bound_ports() calls.

That in turn means that tcp_port_detect() doesn't need to run in the ns
either, and we can just replace it with inline calls to get_bound_ports().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:24 +01:00
David Gibson
e90f2770ae port_fwd: Move automatic port forwarding code to port_fwd.[ch]
The implementation of scanning /proc files to do automatic port forwarding
is a bit awkwardly split between procfs_scan_listen() in util.c,
get_bound_ports() and related functions in conf.c and the initial setup
code in conf().

Consolidate all of this into port_fwd.h, which already has some related
definitions, and a new port_fwd.c.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:14 +01:00
Stefano Brivio
a469fc393f tcp, tap: Don't increase tap-side sequence counter for dropped frames
...so that we'll retry sending them, instead of more-or-less silently
dropping them. This happens quite frequently if our sending buffer on
the UNIX domain socket is heavily constrained (for instance, by the
208 KiB default memory limit).

It might be argued that dropping frames is part of the expected TCP
flow: we don't dequeue those from the socket anyway, so we'll
eventually retransmit them.

But we don't need the receiver to tell us (by the way of duplicate or
missing ACKs) that we couldn't send them: we already know as
sendmsg() reports that. This seems to considerably increase
throughput stability and throughput itself for TCP connections with
default wmem_max values.

Unfortunately, the 16 bits left as padding in the frame descriptors
we use internally aren't enough to uniquely identify for which
connection we should update sequence numbers: create a parallel
array of pointers to sequence numbers and L4 lengths, of
TCP_FRAMES_MEM size, and go through it after calling sendmsg().

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:39:58 +02:00
Stefano Brivio
d3192f67c4 tcp: Force TCP_WINDOW_CLAMP before resetting STALLED flag
It looks like we need it as workaround for this situation, readily
reproducible at least with a 6.5 Linux kernel, with default rmem_max
and wmem_max values:

- an iperf3 client on the host sends about 160 KiB, typically
  segmented into five frames by passt. We read this data using
  MSG_PEEK

- the iperf3 server on the guest starts receiving

- meanwhile, the host kernel advertised a zero-sized window to the
  sender, as expected

- eventually, the guest acknowledges all the data sent so far, and
  we drop it from the buffer, courtesy of tcp_sock_consume(), using
  recv() with MSG_TRUNC

- the client, however, doesn't get an updated window value, and
  even keepalive packets are answered with zero-window segments,
  until the connection is closed

It looks like dropping data from a socket using MSG_TRUNC doesn't
cause a recalculation of the window, which would be expected as a
result of any receiving operation that invalidates data on a buffer
(that is, not with MSG_PEEK).

Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to
the previous value we clamped to, forces a recalculation of the
window which is advertised to the sender.

I couldn't quite confirm this issue by following all the possible
code paths in the kernel, yet. If confirmed, this should be fixed in
the kernel, but meanwhile this workaround looks robust to me (and it
will be needed for backward compatibility anyway).

Reported-by: Matej Hrica <mhrica@redhat.com>
Link: https://bugs.passt.top/show_bug.cgi?id=74
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:27:15 +02:00
Stefano Brivio
feaeb4986c tcp: Fix comment to tcp_sock_consume()
Note that tcp_sock_consume() doesn't update ACK sequence counters
anymore.

Fixes: cc6d8286d1 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-10-04 23:24:08 +02:00
David Gibson
117b474f85 cppcheck: Work around bug in cppcheck 2.12.0
cppcheck 2.12.0 (and maybe some other versions) things this if condition
is always true, which is demonstrably not true.  Work around the bug for
now.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:24:05 +02:00
David Gibson
6471c7d01b cppcheck: Make many pointers const
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-10-04 23:23:35 +02:00
David Gibson
fc8f0f8c48 siphash: Use incremental rather than all-at-once siphash functions
We have a bunch of variants of the siphash functions for different data
sizes.  The callers, in tcp.c, need to pack the various values they want to
hash into a temporary structure, then call the appropriate version.  We can
avoid the copy into the temporary by directly using the incremental
siphash functions.

The length specific hash functions also have an undocumented constraint
that the data pointer they take must, in fact, be aligned to avoid
unaligned accesses, which may cause crashes on some architectures.

So, prefer the incremental approach and remove the length-specific
functions.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:53 +02:00
David Gibson
ca6e94702c siphash: Make siphash functions consistently return 64-bit results
Some of the siphas_*b() functions return 64-bit results, others 32-bit
results, with no obvious pattern.  siphash_32b() also appears to do this
incorrectly - taking the 64-bit hash value and simply returning it
truncated, rather than folding the two halves together.

Since SipHash proper is defined to give a 64-bit hash, make all of them
return 64-bit results.  In the one caller which needs a 32-bit value,
tcp_seq_init() do the fold down to 32-bits ourselves.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-30 12:40:30 +02:00
David Gibson
c1d2a070f2 util: Consolidate and improve workarounds for clang-tidy issue 58992
We have several workarounds for a clang-tidy bug where the checker doesn't
recognize that a number of system calls write to - and therefore initialise
- a socket address.  We can't neatly use a suppression, because the bogus
warning shows up some time after the actual system call, when we access
a field of the socket address which clang-tidy erroneously thinks is
uninitialised.

Consolidate these workarounds into one place by using macros to implement
wrappers around affected system calls which add a memset() of the sockaddr
to silence clang-tidy.  This removes the need for the individual memset()
workarounds at the callers - and the somewhat longwinded explanatory
comments.

We can then use a #define to not include the hack in "real" builds, but
only consider it for clang-tidy.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:26:06 +02:00
David Gibson
5b6c68c2e4 Avoid shadowing index(3)
A classic gotcha of the standard C library is that its unwise to call any
variable 'index' because it will shadow the standard string library
function index(3).  This can cause warnings from cppcheck amongst others,
and it also means that if the variable is removed you tend to get confusing
type errors (or sometimes nothing at all) instead of a nice simple "name is
not defined" error.

Strictly speaking this only occurs if <string.h> is included, but that
is so common that as a rule it's best to just avoid it always.  We
have a number of places which hit this trap, so rename variables and
parameters to avoid it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-27 17:25:51 +02:00
Stefano Brivio
9178a9e346 tcp: Always send an ACK segment once the handshake is completed
The reporter is running a SMTP server behind pasta, and the client
waits for the server's banner before sending any data. In turn, the
server waits for our ACK after sending SYN,ACK, which never comes.

If we use the ACK_IF_NEEDED indication to tcp_send_flag(), given that
there's no pending data, we delay sending the ACK segment at the end
of the three-way handshake until we have some data to send to the
server.

This was actually intended, as I thought we would lower the latency
for new connections, but we can't assume that the client will start
sending data first (SMTP is the typical example where this doesn't
happen).

And, trying out this patch with SSH (where the client starts sending
data first), the reporter actually noticed we have a lower latency
by forcing an ACK right away. Comparing a capture before the patch:

13:07:14.007704 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [S], seq 1797034836, win 65535, options [mss 4096,nop,wscale 7], length 0
13:07:14.007769 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [S.], seq 2297052481, ack 1797034837, win 65480, options [mss 65480,nop,wscale 7], length 0
13:07:14.008462 IP 10.1.2.1.42056 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 65535, length 21
13:07:14.008496 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [.], ack 22, win 512, length 0
13:07:14.011799 IP 10.1.2.140.1234 > 10.1.2.1.42056: Flags [P.], seq 1:515, ack 22, win 512, length 514

and after:

13:10:26.165364 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [S], seq 4165939595, win 65535, options [mss 4096,nop,wscale 7], length 0
13:10:26.165391 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [S.], seq 985607380, ack 4165939596, win 65480, options [mss 65480,nop,wscale 7], length 0
13:10:26.165418 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], ack 1, win 512, length 0
13:10:26.165683 IP 10.1.2.1.59508 > 10.1.2.140.1234: Flags [.], seq 1:22, ack 1, win 512, length 21
13:10:26.165698 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [.], ack 22, win 512, length 0
13:10:26.167107 IP 10.1.2.140.1234 > 10.1.2.1.59508: Flags [P.], seq 1:515, ack 22, win 512, length 514

the latency between the initial SYN segment and the first data
transmission actually decreases from 792µs to 334µs. This is not
statistically relevant as we have a single measurement, but it can't
be that bad, either.

Reported-by: cr3bs (from IRC)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-09-27 17:25:30 +02:00
David Gibson
46f915ddee tcp: Correct handling of FIN,ACK followed by SYN
When the guest tries to establish a connection, it could give up on it by
sending a FIN,ACK instead of a plain ACK to our SYN,ACK.  It could then
make a new attempt to establish a connection with the same addresses and
ports with a new SYN.

Although it's unlikely, it could send the 2nd SYN very shortly after the
FIN,ACK resulting in both being received in the same batch of packets from
the tap interface.

Currently, we don't handle that correctly, when we receive a FIN,ACK on a
not fully established connection we discard the remaining packets in the
batch, and so will never process the 2nd SYN.  Correct this by returning
1 from tcp_tap_handler() in this case, so we'll just consume the FIN,ACK
and continue to process the rest of the batch.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:22 +02:00
David Gibson
b3f2210b05 tcp: Consolidate paths where we initiate reset on tap interface
There are a number of conditions where we will issue a TCP RST in response
to something unexpected we received from the tap interface.  These occur in
both tcp_data_from_tap() and tcp_tap_handler().  In tcp_tap_handler() use
a 'goto out of line' technique to consolidate all these paths into one
place.  For the tcp_data_from_tap() cases use a negative return code and
direct that to the same path in tcp_tap_handler(), its caller.

In this case we want to discard all remaining packets in the batch we have
received: even if they're otherwise good, they'll be invalidated when the
guest receives the RST we're sending.  This is subtly different from the
case where we *receive* an RST, where we could in theory get a new SYN
immediately afterwards.  Clarify that with a common on the now common
reset path.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:19 +02:00
David Gibson
f984003fdf tcp: Correctly handle RST followed rapidly by SYN
Although it's unlikely in practice, the guest could theoretically
reset one TCP connection then immediately start a new one with the
same addressses and ports, such that we get an RST then a SYN in the
same batch of received packets in tcp_tap_handler().

We don't correctly handle that unlikely case, because when we receive
the RST, we discard any remaining packets in the batch so we'd never
see the SYN.  This could happen in either tcp_tap_handler() or
tcp_data_from_tap().  Correct that by returning 1, so that the caller
will continue calling tcp_tap_handler() on subsequent packets allowing
us to process any subsequent SYN.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:17 +02:00
David Gibson
60d3915ea3 tcp: Return consumed packet count from tcp_data_from_tap()
Currently tcp_data_from_tap() is assumed to consume all packets remaining
in the packet pool it is given.  However there are some edge cases where
that's not correct.  In preparation for fixing those, change it to return
a count of packets consumed and use that in its caller.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:13 +02:00
David Gibson
5fb376de6e tcp: Never hash match closed connections
>From a practical point of view, when a TCP connection ends, whether by
FIN or by RST, we set the CLOSED event, then some time later we remove the
connection from the hash table and clean it up.  However, from a protocol
point of view, once it's closed, it's gone, and any new packets with
matching addresses and ports are either forming a new connection, or are
invalid packets to discard.

Enforce these semantics in the TCP hash logic by never hash matching closed
connections.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:10 +02:00
David Gibson
805dd109a4 tcp: Remove some redundant packet_get() operations
Both tcp_data_from_tap() and tcp_tap_handler() call packet_get() to get
the entire L4 packet length, then immediately call it again to check that
the packet is long enough to include a TCP header.  The features of
packet_get() let us easily combine these together, we just need to adjust
the length slightly, because we want the value to include the TCP header
length.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:16:07 +02:00
David Gibson
043a70b885 tcp, tap: Correctly advance through packets in tcp_tap_handler()
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each TCP
sequence calling tcp_tap_handler().  Or so it appears.

In fact, tcp_tap_handler() doesn't take an index and always looks at packet
0 of the sequence, except when it calls tcp_data_from_tap() to process
data packets.  It appears to be written with the idea that the struct pool
is a queue, from which it consumes packets as it processes them, but that's
not how the pool data structure works - they are more like an array of
packets.

We only get away with this, because setup packets for TCP tend to come in
separate batches (because we need to reply in between) and so we only get
a bunch of packets for the same connection together when they're data
packets (tcp_data_from_tap() has its own loop through packets).

Correct this by adding an index parameter to tcp_tap_handler() and altering
the loops in tap.c to step through the pool properly.

Link: https://bugs.passt.top/show_bug.cgi?id=68
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-09-08 09:15:46 +02:00
David Gibson
69303cafbe tcp: Remove broken pressure calculations for tcp_defer_handler()
tcp_defer_handler() performs a potentially expensive linear scan of the
connection table.  So, to mitigate the cost of that we skip if if we're not
under at least moderate pressure: either 30% of available connections or
30% (estimated) of available fds used.

But, the calculation for this has been broken since it was introduced: we
calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS,
meaning we only exit early if conn_count is less than 30% of itself, i.e.
never.

If that calculation is "corrected" to be based on TCP_MAX_CONNS, it
completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my
laptop.  My guess is that this is because in the case of many short lived
connections, we're letting the table become much fuller before compacting
it.  That means that other places which perform a table scan now have to
do much, much more.

For the time being, simply remove the tests, since they're not doing
anything useful.  We can reintroduce them more carefully if we see a need
for them.

This also removes the only user of c->tcp.splice_conn_count, so that can
be removed as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:41 +02:00
David Gibson
b60fa33eea tcp: Move in_epoll flag out of common connection structure
The in_epoll boolean is one of only two fields (currently) in the common
structure shared between tap and spliced connections.  It seems like it
belongs there, because both tap and spliced connections use it, and it has
roughly the same meaning.

Roughly, however, isn't exactly: which fds this flag says are in the epoll
varies between the two connection types, and are in type specific fields.
So, it's only possible to meaningfully use this value locally in type
specific code anyway.

This common field is going to get in the way of more widespread
generalisation of connection / flow tracking, so move it to separate fields
in the tap and splice specific structures.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:36 +02:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
5bf200ae8a tcp, udp: Don't include destination address in partially precomputed csums
We partially prepopulate IP and TCP header structures including, amongst
other things the destination address, which for IPv4 is always the known
address of the guest/namespace.  We partially precompute both the IPv4
header checksum and the TCP checksum based on this.

In future we're going to want more flexibility with controlling the
destination for IPv4 (as we already do for IPv6), so this precomputed value
gets in the way.  Therefore remove the IPv4 destination from the
precomputed checksum and fold it into the checksum update when we actually
send a packet.

Doing this means we no longer need to recompute those partial sums when
the destination address changes ({tcp,udp}_update_l2_buf()) and instead
the computation can be moved to compile time.  This means while we perform
slightly more computations on each packet, we slightly reduce the amount of
memory we need to access.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:30 +02:00
David Gibson
8aa32009ed tcp: Consistent usage of ports in tcp_seq_init()
In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since
it's used for connections in both directions.  However, these values are
just feeding a hash, so as long as we're consistent and include all the
information we want, it doesn't really matter.

Oddly, for the "src" side we supply the (tap side) forwarding address but
the (tap side) endpoint port.  This again doesn't really matter, but it's
confusing.  So swap this with dstport, so "src" is always forwarding
and "dst" is always endpoint.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:27 +02:00
David Gibson
bccfbff193 tcp: More precise terms for addresses and ports
In a number of places the comments and variable names we use to describe
addresses and ports are ambiguous.  It's not sufficient to describe a port
as "tap-facing" or "socket-facing", because on both the tap side and the
socket side there are two ports for the two ends of the connection.
Similarly, "local" and "remote" aren't particularly helpful, because it's
not necessarily clear whether we're talking from the point of view of the
guest/namespace, the host, or passt itself.

This patch makes a number of changes to be more precise about this.  It
introduces two new terms in aid of this:
    A "forwarding" address (or port) refers to an address which is local
from the point of view of passt itself.  That is a source address for
traffic sent by passt, whether it's to the guest via the tap interface
or to a host on the internet via a socket.
    The "endpoint" address (or port) is the reverse: a remote address
from passt's point of view, the destination address for traffic sent
by passt.

Between them the "side" (either tap/guest-facing or sock/host-facing)
and forwarding vs. endpoint unambiguously describes which address or
port we're talking about.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:24 +02:00
David Gibson
cee4a2da48 tap: Pass source address to protocol handler functions
The tap code passes the IPv4 or IPv6 destination address of packets it
receives to the protocol specific code.  Currently that protocol code
doesn't use the source address, but we want it to in future.  So, in
preparation, pass the IPv4/IPv6 source address of tap packets to those
functions as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:21 +02:00
David Gibson
485b5fb8f9 epoll: Split handling of listening TCP sockets into their own handler
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common.  Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level.  Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning.  So,
switch listening sockets to their own reference type which we can lay out
as we please.  That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:15 +02:00
David Gibson
e6f81e5578 epoll: Split handling of TCP timerfds into its own handler function
tcp_sock_handler() actually handles several different types of fd events.
This includes timerfds that aren't sockets at all.  The handling of these
has essentially nothing in common with the other cases.  So, give the
TCP timers there own epoll_type value and dispatch directly to their
handler.  This also means we can remove the timer field from tcp_epoll_ref,
the information it encoded is now implicit in the epoll_type value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:13 +02:00
David Gibson
3401644453 epoll: Generalize epoll_ref to cover things other than sockets
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd.  However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future.  Rename these fields to
an abstract "fd type" and file descriptor for more generality.

Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space.  For now these just correspond to the
supported protocols, but we'll expand on that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:51 +02:00
David Gibson
8218d99013 Use C11 anonymous members to make poll refs less verbose to use
union epoll_ref has a deeply nested set of structs and unions to let us
subdivide it into the various different fields we want.  This means that
referencing elements can involve an awkward long string of intermediate
fields.

Using C11 anonymous structs and unions lets us do this less clumsily.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:17:57 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
Stefano Brivio
33d88f79d9 tcp: Clear ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer
Since commit cc6d8286d1 ("tcp: Reset ACK_FROM_TAP_DUE flag only as
needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we
process an ACK segment, but, more correctly, only if we're really not
waiting for a further ACK segment, that is, only if the acknowledged
sequence matches what we sent.

In the new function implementing this, tcp_update_seqack_from_tap(),
we also reset the retransmission counter and store the updated ACK
sequence. Both should be done iff forward progress is acknowledged,
implied by the fact that the new ACK sequence is greater than the
one we previously stored.

At that point, it looked natural to also include the statements that
clear and set the ACK_FROM_TAP_DUE flag inside the same conditional
block: if we're not making forward progress, the need for an ACK, or
lack thereof, should remain unchanged.

There might be cases where this isn't true, though: without the
previous commit 4e73e9bd65 ("tcp: Don't special case the handling
of the ack of a syn"), this would happen if a tap-side client
initiated a connection, and the server didn't send any data.

At that point we would never, in the established state of the
connection, call tcp_update_seqack_from_tap() with reported forward
progress.

That issue itself is fixed by the previous commit, now, but clearing
ACK_FROM_TAP_DUE only on ACK sequence progress doesn't really follow
any logic.

Clear the ACK_FROM_TAP_DUE flag regardless of reported forward
progress. If we clear it when it's already unset, conn_flag() will do
nothing with it.

This doesn't fix any known functional issue, rather a conceptual one.

Fixes: cc6d8286d1 ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer")
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:47:17 +02:00
David Gibson
4e73e9bd65 tcp: Don't special case the handling of the ack of a syn
TCP treats the SYN packets as though they occupied 1 byte in the logical
data stream described by the sequence numbers.  That is, the very first ACK
(or SYN-ACK) each side sends should acknowledge a sequence number one
greater than the initial sequence number given in the SYN or SYN-ACK it's
responding to.

In passt we were tracking that by advancing conn->seq_to_tap by one when
we send a SYN or SYN-ACK (in tcp_send_flag()).  However, we also
initialized conn->seq_ack_from_tap, representing the acks we've already
seen from the tap side, to ISN+1, meaning we treated it has having
acknowledged the SYN before it actually did.

There were apparently reasons for this in earlier versions, but it causes
problems now.  Because of this when we actually did receive the initial ACK
or SYN-ACK, we wouldn't see the acknoweldged serial number as advancing,
and so wouldn't clear the ACK_FROM_TAP_DUE flag.

In most cases we'd get away because subsequent packets would clear the
flag.  However if one (or both) sides didn't send any data, the other side
would (correctly) keep sending ISN+1 as the acknowledged sequence number,
meaning we would never clear the ACK_FROM_TAP_DUE flag.  That would mean
we'd treat the connection as if we needed to retransmit (although we had
0 bytes to retransmit), and eventaully (after around 30s) reset the
connection due to too many retransmits.  Specifically this could cause the
iperf3 throughput tests in the testsuite to fail if set for a long enough
test period.

Correct this by initializing conn->seq_ack_from_tap to the ISN and only
advancing it when we actually get the first ACK (or SYN-ACK).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:47:07 +02:00
David Gibson
085672f77c tcp: Clarify allowed state for tcp_data_from_tap()
Comments suggest that this should only be called for an ESTABLISHED
connection.  However, it's non-trivial to ascertain that from the actual
control flow in the caller.  Add an ASSERT() to make it very clear that
this is only called in ESTABLISHED state.

In fact, there were some circumstances where it could be called on a CLOSED
connection.  In a sense that is "established", but with that assert this
does require specific (trivial) handling to avoid a spurious abort().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-29 13:46:58 +02:00
Stefano Brivio
1ee2f7cada tcp: Don't reset ACK_TO_TAP_DUE on any ACK, reschedule timer as needed
This is mostly symmetric with commit cc6d8286d1 ("tcp: Reset
ACK_FROM_TAP_DUE flag only as needed, update timer"): we shouldn't
reset the ACK_TO_TAP_DUE flag on any inbound ACK segment, but only
once we acknowledge everything we received from the guest or the
container.

If we don't, a client might unnecessarily hold off further data,
especially during slow start, and in general we won't converge to the
usable bandwidth.

This is very visible especially with traffic tests on links with
non-negligible latency, such as in the reported issue. There, a
public iperf3 server sometimes aborts the test due do what appears to
be a low iperf3's --rcv-timeout (probably less than a second). Even
if this doesn't happen, the throughput will converge to a fraction of
the usable bandwidth.

Clear ACK_TO_TAP_DUE if we acknowledged everything, set it if we
didn't, and reschedule the timer in case the flag is still set as the
timer expires.

While at it, decrease the ACK timer interval to 10ms.

A 50ms interval is short enough for any bandwidth-delay product I had
in mind (local connections, or non-local connections with limited
bandwidth), but here I am, testing 1gbps transfers to a peer with
100ms RTT.

Indeed, we could eventually make the timer interval dependent on the
current window and estimated bandwidth-delay product, but at least
for the moment being, 10ms should be long enough to avoid any
measurable syscall overhead, yet usable for any real-world
application.

Reported-by: Lukas Mrtvy <lukas.mrtvy@gmail.com>
Link: https://bugs.passt.top/show_bug.cgi?id=44
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 23:14:58 +01:00
Stefano Brivio
9ffccf7acc tcp: When a connection flag it set, don't negate it for debug print
Fix a copy and paste typo I added in commit 5474bc5485 ("tcp,
tcp_splice: Get rid of false positive CWE-394 Coverity warning from
fls()") and --debug altogether.

Fixes: 5474bc5485 ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-03-21 19:39:55 +01:00