With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
We have different versions of this function for IPv4 and IPv6, but the
caller already requires some IP version specific code to get the right
header pointers. Instead, have a common function that fills either an
IPv4 or an IPv6 header based on which header pointer it is passed. This
allows us to remove a small amount of code duplication and make a few
slightly ugly conditionals.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only reason we need separate functions for the IPv4 and IPv6 case is
to calculate the checksum of the IP pseudo-header, which is different for
the two cases. However, the caller already knows which path it's on and
can access the values needed for the pseudo-header partial sum more easily
than tcp_update_check_tcp[46]() can.
So, merge these functions into a single tcp_update_csum() function that
just takes the pseudo-header partial sum, calculated in the caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At the moment these take separate pointers to the tap specific and IP
headers, but expect the TCP header and payload as a single tcp_payload_t.
As well as being slightly inconsistent, this involves some slightly iffy
pointer shenanigans when called on the flags path with a tcp_flags_t
instead of a tcp_payload_t.
More importantly, it's inconvenient for the upcoming vhost-user case, where
the TCP header and payload might not be contiguous. Furthermore, the
payload itself might not be contiguous.
So, pass the TCP header as its own pointer, and the TCP payload as an IO
vector.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently these expects both the TCP header and payload in a single IOV,
and goes to some trouble to locate the checksum field within it. In the
current caller we've already know where the TCP header is, so we might as
well just pass it in. This will need to work a bit differently for
vhost-user, but that code already needs to locate the TCP header for other
reasons, so again we can just pass it in.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function only has callers in tcp_buf.c. More importantly, it's
inherently tied to the "buf" path, because it uses internal knowledge of
how we lay out the various headers across our locally allocated buffers.
Therefore, move it to tcp_buf.c.
Slightly reformat the prototypes while we're at it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
RFC 9293, 3.8.4 says:
Implementers MAY include "keep-alives" in their TCP implementations
(MAY-5), although this practice is not universally accepted. Some
TCP implementations, however, have included a keep-alive mechanism.
To confirm that an idle connection is still active, these
implementations send a probe segment designed to elicit a response
from the TCP peer. Such a segment generally contains SEG.SEQ =
SND.NXT-1 and may or may not contain one garbage octet of data. If
keep-alives are included, the application MUST be able to turn them
on or off for each TCP connection (MUST-24), and they MUST default to
off (MUST-25).
but currently, tcp_data_from_tap() is not aware of this and will
schedule a fast re-transmit on the second keep-alive (because it's
also a duplicate ACK), ignoring the fact that the sequence number was
rewinded to SND.NXT-1.
ACK these keep-alive segments, reset the activity timeout, and ignore
them for the rest.
At some point, we could think of implementing an approximation of
keep-alive segments on outbound sockets, for example by setting
TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single
keep-alive segment at approximately the same time, and never reset the
connection. That's beyond the scope of this fix, though.
Reported-by: Tim Besard <tim.besard@gmail.com>
Link: https://github.com/containers/podman/discussions/24572
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We enter the timer handler with the ACK_TO_TAP_DUE flag, call
tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we
acknowledged everything meanwhile, so we return early, but we also
need to reset that flag to avoid unnecessarily scheduling the timer
over and over again until more pending data appears.
I'm not sure if this fixes any real issue, but I've spotted this
in several logs reported by users, including one where we have some
unexpected bursts of high CPU load during TCP transfers at low rates,
from https://github.com/containers/podman/issues/23686.
Link: https://github.com/containers/podman/discussions/24572
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:
/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1139 | int fd = socket(AF_UNIX, SOCK_STREAM, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1158 | ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1413 | s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
188 | if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
| ^
| | SOCK_CLOEXEC
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
tcp_info.h exists just to contain a modern enough version of struct
tcp_info for our needs, removing compile time dependency on the version of
kernel headers. There are several other cases where we can remove similar
compile time dependencies on kernel version. Prepare for that by renaming
tcp_info.h to linux_dep.h.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use a plain uint16_t instead and avoid including one extra header:
the 'bitwise' attribute of __sum16 is just used by sparse(1).
Reported-by: omni <omni+alpine@hack.org>
Fixes: 3d484aa370 ("tcp: Update TCP checksum using an iovec array")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For clock_gettime(), we shouldn't ignore errors if they happen at
initialisation phase, because something is seriously wrong and it's
not helpful if we proceed as if nothing happened.
As we're up and running, though, it's probably better to report the
error and use a stale value than to terminate altogether. Make sure
we use a zero value if we don't have a stale one somewhere.
For timerfd_gettime() and timerfd_settime() failures, just report an
error, there isn't much else we can do.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Following the preparations in the previous commit, we can now remove
the payload and flag queues dedicated for TCPv6 and TCPv4 and move all
traffic into common queues handling both protocol types.
Apart from reducing code and memory footprint, this change reduces
a potential risk for TCPv4 traffic starving out TCPv6 traffic.
Since we always flush out the TCPv4 frame queue before the TCPv6 queue,
the latter will never be handled if the former fails to send all its
frames.
Tests with iperf3 shows no measurable change in performance after this
change.
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use particular fields from the TCP_INFO getsockopt() we
need them to be in structure returned by the runtime kernel. We attempt
to determine that with the HAS_BYTES_ACKED and HAS_MIN_RTT defines, probed
in the Makefile.
However, that's not correct, because the kernel headers we compile against
may not be the same as the runtime kernel. We instead should check against
the size of structure returned from the TCP_INFO getsockopt() as we already
do for tcpi_snd_wnd. Switch from the compile time flags to a runtime
test.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In order to use the tcpi_snd_wnd field from the TCP_INFO getsockopt() we
need the field to be supported in the runtime kernel (snd_wnd_cap).
In fact we should check that for for every tcp_info field we want to use,
beyond the very old ones shared with BSD. Prepare to do that, by
generalising the probing from setting a single bool to instead record the
size of the returned TCP_INFO structure. We can then use that recorded
value to check for the presence of any field we need.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the Makefile we probe to create several defines based on the presence
of particular fields in struct tcp_info. These defines are used for two
purposes, neither of which they accomplish well:
1) Determining if the tcp_info fields are available at runtime. For this
purpose the defines are Just Plain Wrong, since the runtime kernel may
not be the same as the compile time kernel. We corrected this for
tcp_snd_wnd, but not for tcpi_bytes_acked or tcpi_min_rtt
2) Allowing the source to compile against older kernel headers which don't
have the fields in question. This works in theory, but it does mean
we won't be able to use the fields, even if later run against a
newer kernel. Furthermore, it's quite fragile: without much more
thorough tests of builds in different environments that we're currently
set up for, it's very easy to miss cases where we're accessing a field
without protection from an #ifdef. For example we currently access
tcpi_snd_wnd without #ifdefs in tcp_update_seqack_wnd().
Improve this with a different approach, borrowed from qemu (which has many
instances of similar problems). Don't compile against linux/tcp.h, using
netinet/tcp.h instead. Then for when we need an extension field, define
a struct tcp_info_linux, copied from the kernel, with all the fields we're
interested in. That may need updating from future kernel versions, but
only when we want to use a new extension, so it shouldn't be frequent.
This allows us to remove the HAS_SND_WND define entirely. We keep
HAS_BYTES_ACKED and HAS_MIN_RTT now, since they're used for purpose (1),
we'll fix that in a later patch.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Trivial grammar fixes in comments]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As a rule, we prefer constructing packets with matching C structures,
rather than building them byte by byte. However, one case we still build
byte by byte is the TCP options we include in SYN packets (in fact the only
time we generate TCP options on the tap interface).
Rework this to use a structure and initialisers which make it a bit
clearer what's going on.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by; Stefano Brivio <sbrivio@redhat.com>
Starting from commit 9178a9e346 ("tcp: Always send an ACK segment
once the handshake is completed"), we always send an ACK segment,
without any payload, to complete the three-way handshake while
establishing a connection started from a socket.
We queue that segment after checking if we already have data to send
to the tap, which means that its sequence number is higher than any
segment with data we're sending in the same iteration, if any data is
available on the socket.
However, in tcp_defer_handler(), we first flush "flags" buffers, that
is, we send out segments without any data first, and then segments
with data, which means that our "empty" ACK is sent before the ACK
segment with data (if any), which has a lower sequence number.
This appears to be harmless as the guest or container will generally
reorder segments, but it looks rather weird and we can't exclude it's
actually causing problems.
Queue the empty ACK first, so that it gets a lower sequence number,
before checking for any data from the socket.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
TCP header and payload are supposed to be in the same buffer,
and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute
the checksum from the base address of the header using the
length of the IP payload.
In the future (for vhost-user) we need to dispatch the TCP header and
the TCP payload through several buffers. To be able to manage that, we
provide an iovec array that points to the data of the TCP frame.
We provide also an offset to be able to provide an array that contains
the TCP frame embedded in an lower level frame, and this offset points
to the TCP header inside the iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the
checksum using the TCP header and the TCP payload, it is clearer
to use a pointer to tcp_payload_t that includes tcphdr and payload
rather than a pointer to tcphdr (and guessing TCP header is
followed by the payload).
Move tcp_payload_t and tcp_flags_t to tcp_internal.h.
(They will be used also by vhost-user).
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.
Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress. This case doesn't yet arise, but
it might cause us trouble in future.
Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic. This only makes sense if there is no specific
address given. We verify this at runtime, but it would be nicer if we
could enforce it structurally.
For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems. Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.
The new function takes the address as an inany *, with NULL indicating the
dual stack case. This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We can need not to set TCP checksum. Add a parameter to
tcp_fill_headers4() and tcp_fill_headers6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This parameter is already treated as a boolean internally. Make it a
'bool' type for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This function has a block conditional on !snd_wnd_cap shortly before an
snd_wnd_cap is statically false).
Therefore, simplify this down to a single conditional with an else branch.
While we're there, fix some improperly indented closing braces.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When available, we want to retrieve our socket peer's advertised window and
forward that to the guest. That information has been available from the
kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035.
Currently our probing for this is a bit odd. The HAS_SND_WND define
determines if our headers include the tcp_snd_wnd field, but that doesn't
necessarily mean the running kernel supports it. Currently we start by
assuming it's _not_ available, but mark it as available if we ever see
a non-zero value in the field. This is a bit hit and miss in two ways:
* Zero is perfectly possible window the peer could report, so we can
get false negatives
* We're reading TCP_INFO into a local variable, which might not be zero
initialised, so if the kernel _doesn't_ write it it could have non-zero
garbage, giving us false positives.
We can use a more direct way of probing for this: getsockopt() reports the
length of the information retreived. So, check whether that's long enough
to include the field. This lets us probe the availability of the field
once and for all during initialisation. That in turn allows ctx to become
a const pointer to tcp_prepare_flags() which cascades through many other
functions.
We also move the flag for the probe result from the ctx structure to a
global, to match peek_offset_cap.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c,
and have no prototype in a header. Make them static.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, for not established connections, we monitor sockets with
edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state
(outbound connection being established) but not in the
TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK
to the container/guest).
While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted
another possibility for a short EPOLLRDHUP storm (10 seconds), which
doesn't seem to happen in actual use cases, but I could reproduce it:
start a connection from a container, while dropping (using netfilter)
ACK segments coming out of the container itself.
On the server side, outside the container, accept the connection and
shutdown the writing side of it immediately.
At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere
TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't
have any reasonable way to handle it other than waiting for the tap
side to complete the three-way handshake. So we'll just keep getting
this EPOLLRDHUP until the SYN_TIMEOUT kicks in.
Always enable EPOLLET when EPOLLRDHUP is the only epoll event we
subscribe to: in this case, getting multiple EPOLLRDHUP reports is
totally useless.
In the only remaining non-established state, SOCK_ACCEPTED, for
inbound connections, we're anyway discarding EPOLLRDHUP events until
we established the conection, because we don't know what to do with
them until we get an answer from the tap side, so it's safe to enable
EPOLLET also in that case.
Link: https://bugs.passt.top/show_bug.cgi?id=94
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.
Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.
Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This:
$ ./pasta
SO_PEEK_OFF not supported
#
is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.
Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.
Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Based on an original patch by Jon Maloy:
--
The recently added socket option SO_PEEK_OFF is not supported for
TCP/IPv6 sockets. Until we get that support into the kernel we need to
test for support in both protocols to set the global 'peek_offset_cap´
to true.
--
Compared to the original patch:
- only check for SO_PEEK_OFF support for enabled IP versions
- use sa_family_t instead of int to pass the address family around
Fixes: e63d281871 ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp. Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_lookup_af() now needs to verify that the entry has a matching
protocol and interface as well as matching addresses and ports.
* We double the size of the hash table, because it's now at least
theoretically possible for both sides of each flow to be hashed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the pif and the L4 protocol number.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address. This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address. That won't
work; it will probably result in an RST from the guest. Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.
Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure. Access it from there instead, and
remove it from the TCP specific structure. With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP. For now this leaves some information
redundantly recorded in both generic and type specific fields. We'll
fix that in later patches.
For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow. Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).
To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that
information be completed when a flows enter INI. Later patches will
populate the target side.
For now this leaves some information redundantly recorded in both generic
and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
TCP (both regular and spliced) and ICMP both have macros to retrieve the
relevant protcol specific flow structure from a flow index. In most cases
what we actually want is to get the specific flow from a sidx. Replace
those simple macros with a more precise inline, which also asserts that
the flow is of the type we expect.
While we're they're also add a pif_at_sidx() helper to get the interface of
a specific flow & side, which is useful in some places.
Finally, fix some minor style issues in the comments on some of the
existing sidx related helpers.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We abort the UDP socket handler if the no_udp flag is set. But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place. If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop. The same is the case for the TCP socket and timer handlers and
the no_tcp flag.
Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A bug in kernel TCP may lead to a deadlock where a zero window is sent
from the guest peer, while it is unable to send out window updates even
after socket reads have freed up enough buffer space to permit a larger
window. In this situation, new window advertisements from the peer can
only be triggered by data packets arriving from this side.
However, currently such packets are never sent, because the zero-window
condition prevents this side from sending out any packets whatsoever
to the peer.
We notice that the above bug is triggered *only* after the peer has
dropped one or more arriving packets because of severe memory squeeze,
and that we hence always enter a retransmission situation when this
occurs. This also means that the implementation goes against the
RFC-9293 recommendation that a previously advertised window never
should shrink.
RFC-9293 seems to permit that we can continue sending up to the right
edge of the last advertised non-zero window in such situations, so that
is what we do to resolve this situation.
It turns out that this solution is extremely simple to implememt in the
code: We just omit to save the advertised zero-window when we see that
it has shrunk, i.e., if the acknowledged sequence number in the
advertisement message is lower than that of the last data byte sent
from our side.
When that is the case, the following happens:
- The 'retr' flag in tcp_data_from_tap() will be 'false', so no
retransmission will occur at this occasion.
- The data stream will soon reach the right edge of the previously
advertised window. In fact, in all observed cases we have seen that
it is already there when the zero-advertisement arrives.
- At that moment, the flags STALLED and ACK_FROM_TAP_DUE will be set,
unless they already have been, meaning that only the next timer
expiration will open for data retransmission or transmission.
- When that happens, the memory squeeze at the guest will normally have
abated, and the data flow can resume.
It should be noted that although this solves the problem we have at
hand, it is a work-around, and not a genuine solution to the described
kernel bug.
Suggested-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor fix in commit title and commit reference in comment
to workaround
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>