Move all the TCP parts using internal buffers to tcp_buf.c
and keep generic TCP management functions in tcp.c.
Add tcp_internal.h to export needed functions from tcp.c and
tcp_buf.h from tcp_buf.c
With this change we can use existing TCP functions with a
different kind of memory storage as for instance the shared
memory provided by the guest via vhost-user.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This commit isolates the internal data structure management used for storing
data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[],
tcp4_flags[], ...) from the tcp_send_flag() function. The extracted
functionality is relocated to a new function named tcp_fill_flag_header().
tcp_fill_flag_header() is now a generic function that accepts parameters such
as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to
pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[].
This separation sets the stage for utilizing tcp_prepare_flags() to
set the memory provided by the guest via vhost-user in future developments.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have several functions which are used as callbacks for NS_CALL() which
only read their void * parameter, they don't write it. The
constParameterCallback warning in cppcheck 2.14.1 complains that this
parameter could be const void *, also pointing out that that would require
casting the function pointer when used as a callback.
Casting the function pointers seems substantially uglier than using a
non-const void * as the parameter, especially since in each case we cast
the void * to a const pointer of specific type immediately. So, suppress
these errors.
I think it would make logical sense to suppress this globally, but that
would cause unmatchedSuppression errors on earlier cppcheck versions. So,
instead individually suppress it, along with unmatchedSuppression in the
relevant places.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Flow table entries need to be fully initialised before returning to the
main epoll loop. Commit 0060acd1 ("flow: Clarify and enforce flow state
transitions") now enforces that: once a flow is allocated we must either
cancel it, or activate it before returning to the main loop, or we will hit
an ASSERT().
Some error paths in tcp_conn_from_tap() weren't correctly updated for this
requirement - we can exit with a flow entry incompletely initialised.
Correct that by cancelling the flows in those situations.
I don't have enough information to be certain if this is the cause for
podman bug 22925, but it plausibly could be.
Fixes: 0060acd11b ("flow: Clarify and enforce flow state transitions")
Link: https://github.com/containers/podman/issues/22925
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We globally disabled this, with a justification lumped together with
several checks about braces. They don't really go together, the others
are essentially a stylistic choice which doesn't match our style. Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.
We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier. Fix the macros, and enable the warning for the
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The th pointer could be const, which causes a cppcheck warning on at least
some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40).
Fixes: e84a01e94c ("tcp: move seq_to_tap update to when frame is queued")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
commit a469fc393f ("tcp, tap: Don't increase tap-side sequence counter for dropped frames")
delayed update of conn->seq_to_tap until the moment the corresponding
frame has been successfully pushed out. This has the advantage that we
immediately can make a new attempt to transmit a frame after a failed
trasnmit, rather than waiting for the peer to later discover a gap and
trigger the fast retransmit mechanism to solve the problem.
This approach has turned out to cause a problem with spurious sequence
number updates during peer-initiated retransmits, and we have realized
it may not be the best way to solve the above issue.
We now restore the previous method, by updating the said field at the
moment a frame is added to the outqueue. To retain the advantage of
having a quick re-attempt based on local failure detection, we now scan
through the part of the outqueue that had do be dropped, and restore the
sequence counter for each affected connection to the most appropriate
value.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We recently introduced this field to keep track of which side of a TCP flow
is the guest/tap facing one. Now that we generically record which pif each
side of each flow is connected to, we can easily derive that, and no longer
need to keep track of it explicitly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have no generic information flows apart from the type and
state, everything else is specific to the flow type. Start introducing
generic flow information by recording the pifs which the flow connects.
To keep track of what information is valid, introduce new flow states:
INI for when the initiating side information is complete, and TGT for
when both sides information is complete, but we haven't chosen the
flow type yet. For now, these states don't do an awful lot, but
they'll become more important as we add more generic information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Each flow in the flow table has two sides, 0 and 1, representing the
two interfaces between which passt/pasta will forward data for that flow.
Which side is which is currently up to the protocol specific code: TCP
uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except
for spliced connections where it uses 0 for the initiating side and 1 for
the target side. ICMP also uses 0 for the host/"sock" side and 1 for the
guest/"tap" side, but in its case the latter is always also the initiating
side.
Make this generically consistent by always using side 0 for the initiating
side and 1 for the target side. This doesn't simplify a lot for now, and
arguably makes TCP slightly more complex, since we add an extra field to
the connection structure to record which is the guest facing side. This is
an interim change, which we'll be able to remove later.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Flows move over several different states in their lifetime. The rules for
these are documented in comments, but they're pretty complex and a number
of the transitions are implicit, which makes this pretty fragile and
error prone.
Change the code to explicitly track the states in a field. Make all
transitions explicit and logged. To the extent that it's practical in C,
enforce what can and can't be done in various states with ASSERT()s.
While we're at it, tweak the docs to clarify the restrictions on each state
a bit.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This adds some extra inany helpers for comparing an inany address to
addresses of a specific family (including special addresses), and building
an inany from an IPv4 address (either statically or at runtime).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The flow dispatches deferred and timer handling for flows centrally, but
needs to call into protocol specific code for the handling of individual
flows. Currently this passes a general union flow *. It makes more sense
to pass the specific relevant flow type structure. That brings the check
on the flow type adjacent to casting to the union variant which it tags.
Arguably, this is a slight abstraction violation since it involves the
generic flow code using protocol specific types. It's already calling into
protocol specific functions, so I don't think this really makes any
difference.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler. For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_fill_headers[46]() fill most of the headers, but the tap specific
header (the frame length for qemu sockets) is filled in afterwards.
Filling this as well:
* Removes a little redundancy between the tcp_send_flag() and
tcp_data_to_tap() path
* Makes calculation of the correct length a little easier
* Removes the now misleadingly named 'vnet_len' variable in
tcp_send_flag()
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code. In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields. We can make initializing such iovs shorter and
clearer with a macro for the purpose.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Recent changes to the TCP code (reworking of the buffer handling) have
meant that it now (again) deals explicitly with the MODE_PASST specific
vnet_len field, instead of using the (partial) abstractions provided by the
tap layer.
The abstractions we had don't work for the new TCP structure, so make some
new ones that do: tap_hdr_iov() which constructs an iovec suitable for
containing (just) the TAP specific header and tap_hdr_update() which
updates it as necessary per-packet.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_fill_headers[46]() compute the L3 packet length from the L4 packet
length, then their caller tcp_l2_buf_fill_headers() converts it back to the
L4 packet length. We can just use the L4 length throughout.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>eewwee
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At various points we need to track the lengths of a packet including or
excluding various different sets of headers. We don't always use the same
variable names for doing so. Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.
To improve clarity, standardise on these names:
dlen: L4 protocol payload length ("data length")
l4len: plen + length of L4 protocol header
l3len: l4len + length of IPv4/IPv6 header
l2len: l3len + length of L2 (ethernet) header
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
csum_ip4_header() takes the packet length as a network endian value. In
general it's very error-prone to pass non-native-endian values as a raw
integer. It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.
It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be able to provide pointers to TCP headers and IP headers without
worrying about alignment in the structure, split the structure into
several arrays and point to each part of the frame using an iovec array.
Using iovec also allows us to simply ignore the first entry when the
vnet length header is not needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set ACK on flags packets only when the acknowledged byte
pointer has advanced, or we hadn't previously set a window. This means
in particular that we can send a window update with no ACK flag, which
doesn't appear to be correct. RFC 9293 requires a receiver to ignore such
a packet [0], and indeed it appears that every non-SYN, non-RST packet
should have the ACK flag.
The reason for the existing logic, rather than always forcing an ACK seems
to be to avoid having the packet mistaken as a duplicate ACK which might
trigger a fast retransmit. However, earlier tests in the function mean we
won't reach here if we don't have either an advance in the ack pointer -
which will already set the ACK flag, or a window update - which shouldn't
trigger a fast retransmit.
[0] https://www.ietf.org/rfc/rfc9293.html#section-3.10.7.4-2.5.2.1
Link: https://github.com/containers/podman/issues/22146
Link: https://bugs.passt.top/show_bug.cgi?id=84
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_send_flag() will sometimes force on the ACK flag for all !SYN packets.
This doesn't make sense for RST packets, where plain RST and RST+ACK have
somewhat different meanings. AIUI, RST+ACK indicates an abrupt end to
a connection, but acknowledges data already sent. Plain RST indicates an
abort, when one end receives a packet that doesn't seem to make sense in
the context of what it knows about the connection. All of the cases where
we send RSTs are the second, so we don't want an ACK flag, but we currently
could add one anyway.
Change that, so we won't add an ACK to an RST unless the caller explicitly
requests it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have different paths for controlling the ACK flag for the SYN and !SYN
paths. This amounts to sometimes forcing on the ACK flag in the !SYN path
regardless of options. We can rearrange things to explicitly be that which
will make things neater for some future changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The DUP_ACK flag to tcp_send_flag() has two effects: first it forces the
setting of the ACK flag in the packet, even if we otherwise wouldn't.
Secondly, it causes a duplicate of the flags packet to be sent immediately
after the first.
Setting the ACK flag to tcp_send_flag() also has the first effect, so
instead of having DUP_ACK also do that, pass both flags when we need both
operations. This slightly simplifies the logic of tcp_send_flag() in a way
that makes some future changes easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These two functions are typically used to calculate values to go into the
iov_base and iov_len fields of a struct iovec. They don't have to be used
for that, though. Rename them in terms of what they actually do: calculate
the base address and total length of the complete frame, including both L2
and tap specific headers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer. We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Replace the macro SET_TCP_HEADER_COMMON_V4_V6() by a new function
tcp_fill_header().
Move IPv4 and IPv6 code from tcp_l2_buf_fill_headers() to
tcp_fill_headers4() and tcp_fill_headers6()
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-10-lvivier@redhat.com>
[dwg: Correct commit message with new function names]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use ethhdr rather than tap_hdr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-9-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The TCP and UDP checksums are computed using the data in the TCP/UDP
payload but also some informations in the IP header (protocol,
length, source and destination addresses).
We add two functions, proto_ipv4_header_psum() and
proto_ipv6_header_psum(), to compute the checksum of the IP
header part.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-ID: <20240303135114.1023026-8-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can find the same function to compute the IPv4 header
checksum in tcp.c, udp.c and tap.c
Use the function defined for tap.c, csum_ip4_header(), but
with the code used in tcp.c and udp.c as it doesn't need a fully
initialiazed IPv4 header, only protocol, tot_len, saddr and daddr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-7-lvivier@redhat.com>
[dwg: Fix weird cppcheck regression; it appears to be a problem
in pre-existing code, but somehow this patch is exposing it]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures. Modify various files to include the new header ip.h when
it's needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding. We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table. This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP connections should typically not have wildcard addresses (0.0.0.0
or ::) nor a zero port number for either endpoint. It's not entirely
clear (at least to me) if it's strictly against the RFCs to do so, but
at any rate the socket interfaces often treat those values
specially[1], so it's not really possible to manipulate such
connections. Likewise they should not have broadcast or multicast
addresses for either endpoint.
However, nothing prevents a guest from creating a SYN packet with such
values, and it's not entirely clear what the effect on passt would be.
To ensure sane behaviour, explicitly check for this case and drop such
packets, logging a debug warning (we don't want a higher level,
because that would allow a guest to spam the logs).
We never expect such an address on an accept()ed socket either, but
just in case, check for it as well.
[1] Depending on context as "unknown", "match any" or "kernel, pick
something for me"
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_listen_handler() uses the epoll reference for the listening socket
it handles, and also passes on one variant of it to
tcp_tap_conn_from_sock() and tcp_splice_conn_from_sock(). The latter
two functions only need a couple of specific fields from the
reference.
Pass those specific values instead of the whole reference, which
localises the handling of the listening (as opposed to accepted)
socket and its reference entirely within tcp_listen_handler().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This makes several tweaks to improve the logic which decides whether
we're able to use the splice method for a new connection.
* Rather than only calling tcp_splice_conn_from_sock() in pasta mode, we
check for pasta mode within it, better localising the checks.
* Previously if we got a connection from a non-loopback address we'd
always fall back to the "tap" path, even if the connection was on a
socket in the namespace. If we did get a non-loopback address on a
namespace socket, something has gone wrong and the "tap" path certainly
won't be able to handle it. Report the error and close, rather than
passing it along to tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our allocation scheme for flow entries means there are some
non-obvious constraints on when what things can be done with an entry.
Add a big doc comment explaining the life cycle.
In addition, make a FLOW_START() macro to mark one of the important
transitions. This encourages correct usage, by making it natural to
only access the flow type specific structure after calling it. It
also logs that a new flow has been created, which is useful for
debugging.
We also add logging when a flow's lifecycle ends. This doesn't need a
new helper, because it can only happen either from flow_alloc_cancel()
or from the flow deferred handler.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The epoll references for both TCP listening sockets and UDP sockets
includes a port number. This gives the destination port that traffic
to that socket will be sent to on the other side. That will usually
be the same as the socket's bound port, but might not if the -t, -u,
-T or -U options are given with different original and forwarded port
numbers.
As we move towards a more flexible forwarding model for passt, it's
going to become possible for that destination port to vary depending
on more things (for example the source or destination address). So,
it will no longer make sense to have a fixed value for a listening
socket.
Change to simpler semantics where this field in the reference gives
the bound port of the socket. We apply the translations to the
correct destination port later on, when we're actually forwarding.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a number of places where we want to handle either a
sockaddr_in or a sockaddr_in6. In some of those we use a void *,
which works ok and matches some standard library interfaces, but
doesn't give a signature level hint that we're dealing with only
sockaddr_in or sockaddr_in6, not (say) sockaddr_un or another type of
socket address. Other places we use a sockaddr_storage, which also
works, but has the same problem in addition to allocating more on the
stack than we need to.
Introduce union sockaddr_inany to explictly handle this case: it has
variants for sockaddr_in and sockaddr_in6. Use it in a number of
places where it's easy to do so.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Our inany_addr type is used in some places to represent either IPv4 or
IPv6 addresses, and we plan to use it more widely. We don't yet
provide constants of this type for special addresses (loopback and
"any"). Add some of these, both the IPv4 and IPv6 variants of those
addresses, but typed as union inany_addr.
To avoid actually adding more things to .data we can use some macros and
casting to overlay the IPv6 versions of these with the standard library's
in6addr_loopback and in6addr_any. For the IPv4 versions we need to create
new constant globals.
For complicated historical reasons, the standard library doesn't
provide constants for IPv4 loopback and any addresses as struct
in_addr. It just has macros of type in_addr_t == uint32_t, which has
some gotchas w.r.t. endianness. We can use some more macros to
address this lack, using macros to effectively create these IPv4
constants as pieces of the inany constants above.
We use this last to avoid some awkward temporary variables just used
to get an address of an IPv4 loopback address.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If tcp_sock_refill_pool() gets an error opening new sockets, it stores the
negative errno of that error in the socket pool. This isn't especially
useful:
* It's inconsistent with the initial state of the pool (all -1)
* It's inconsistent with the state of an entry that was valid and was
then consumed (also -1)
* By the time we did anything with this error code, it's now far removed
from the situation in which the error occurred, making it difficult to
report usefully
We now have error reporting closer to when failures happen on the refill
paths, so just leave a pool slot we can't fill as -1.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We maintain pools of ready-to-connect sockets in both the original and
(for pasta) guest namespace to reduce latency when starting new TCP
connections. If we exhaust those pools we have to take a higher
latency path to get a new socket.
Currently we open-code that fallback in the places we need it. To improve
clarity encapsulate that into helper functions. While we're at it, give
those helpers clearer error reporting.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently if tcp_sock_refill_pool() is unable to fill all the slots in the
pool, it will silently exit. This might lead to a later attempt to get
fds from the pool to fail at which point it will be harder to tell what
originally went wrong.
Instead add warnings if we're unable to refill any of the socket pools when
requested. We have tcp_sock_refill_pool() return an error and report it
in the callers, because those callers have more context allowing for a
more useful message.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently if we get an error opening a new socket while refilling a socket
pool, we carry on to the next slot and try again. This isn't very useful,
since by far the most likely cause of an error is some sort of resource
exhaustion. Trying again will probably just hit the same error, and maybe
even make things worse.
So, instead stop on the first error while refilling the pool, making do
with however many sockets we managed to open before the error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp_sock_refill_pool() stops as soon as it finds an entry in the
pool with a valid fd. This appears to makes sense: we always use fds from
the front of the pool, so if we find a filled one, the rest of the pool
should be filled as well.
However, that's not quite correct. If a previous refill hit errors trying
to open new sockets, it could leave gaps between blocks of valid fds. We're
going to add some changes that could make that more likely.
So, for robustness, instead skip over the filled entry but still try to
refill the rest of the array. We expect simply iterating over the pool to
be of small cost compared to even a single system call, so this shouldn't
have much impact.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When a duplicate ack from the tap side triggers a fast re-transmit, we set
both conn->seq_ack_from_tap and conn->seq_to_tap to the sequence number of
the duplicate ack. Setting seq_to_tap is correct: this is what triggers
the retransmit from this point onwards. Setting seq_ack_from_tap is
not correct, though.
In most cases setting seq_ack_from_tap will be redundant but harmless:
it will have already been updated to the same value by
tcp_update_seqack_from_tap() a few lines above. However that call can
be skipped if tcp_sock_consume() fails, which is rare but possible. In
that case this update will cause problems.
We use seq_ack_from_tap to track two logically distinct things: how much of
the stream has been acked by the guest, and how much of the stream from the
socket has been read and discarded (as opposed to MSG_PEEKed). We attempt
to keep those values the same, because we discard data exactly when it is
acked by the guest. However tcp_sock_consume() failing means we weren't
able to disard the acked data. To handle that case, we skip the usual
update of seq_ack_from_tap, effectively ignoring the ack assuming we'll get
one which supersedes it soon enough. Setting seq_ack_from_tap in the
fast retransmit path, however, means we now really will have the
read/discard point in the stream out of sync with seq_ack_from_tap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we always keep the flow table maximally compact: that is all the
active entries are contiguous at the start of the table. Doing this
sometimes requires moving an entry when one is freed. That's kind of
fiddly, and potentially expensive: it requires updating the hash table for
the new location, and depending on flow type, it may require EPOLL_CTL_MOD,
system calls to update epoll tags with the new location too.
Implement a new way of managing the flow table that doesn't ever move
entries. It attempts to maintain some compactness by always using the
first free slot for a new connection, and mitigates the effect of non
compactness by cheaply skipping over contiguous blocks of free entries.
See the "theory of operation" comment in flow.c for details.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>b
[sbrivio: additional ASSERT(flow_first_free <= FLOW_MAX - 2) to avoid
Coverity Scan false positive]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, flows are only evern finally freed (and the table compacted)
from the deferred handlers. Some future ways we want to optimise managing
the flow table will rely on this, so enforce it: rather than having the
TCP code directly call flow_table_compact(), add a boolean return value to
the per-flow deferred handlers. If true, this indicates that the flow
code itself should free the flow.
This forces all freeing of flows to occur during the flow code's scan of
the table in flow_defer_handler() which opens possibilities for future
optimisations.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp.c open codes the process of allocating a new flow from the
flow table: twice, in fact, once for guest to host and once for host to
guest connections. This duplication isn't ideal and will get worse as we
add more protocols to the flow table. It also makes it harder to
experiment with different ways of handling flow table allocation.
Instead, introduce a function to allocate a new flow: flow_alloc(). In
some cases we currently check if we're able to allocate, but delay the
actual allocation. We now handle that slightly differently with a
flow_alloc_cancel() function to back out a recent allocation. We have that
separate from a flow_free() function, because future changes we have in
mind will need to handle this case a little differently.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>