passt

mirror of https://passt.top/passt synced 2024-06-30 15:02:40 +00:00

Author	SHA1	Message	Date
Stefano Brivio	ca2749e1bd	passt: Relicense to GPL 2.0, or any later version In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-06 18:00:33 +02:00
Stefano Brivio	33d88f79d9	tcp: Clear ACK_FROM_TAP_DUE also on unchanged ACK sequence from peer Since commit `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer"), we don't clear ACK_FROM_TAP_DUE whenever we process an ACK segment, but, more correctly, only if we're really not waiting for a further ACK segment, that is, only if the acknowledged sequence matches what we sent. In the new function implementing this, tcp_update_seqack_from_tap(), we also reset the retransmission counter and store the updated ACK sequence. Both should be done iff forward progress is acknowledged, implied by the fact that the new ACK sequence is greater than the one we previously stored. At that point, it looked natural to also include the statements that clear and set the ACK_FROM_TAP_DUE flag inside the same conditional block: if we're not making forward progress, the need for an ACK, or lack thereof, should remain unchanged. There might be cases where this isn't true, though: without the previous commit `4e73e9bd65` ("tcp: Don't special case the handling of the ack of a syn"), this would happen if a tap-side client initiated a connection, and the server didn't send any data. At that point we would never, in the established state of the connection, call tcp_update_seqack_from_tap() with reported forward progress. That issue itself is fixed by the previous commit, now, but clearing ACK_FROM_TAP_DUE only on ACK sequence progress doesn't really follow any logic. Clear the ACK_FROM_TAP_DUE flag regardless of reported forward progress. If we clear it when it's already unset, conn_flag() will do nothing with it. This doesn't fix any known functional issue, rather a conceptual one. Fixes: `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer") Reported-by: David Gibson <david@gibson.dropbear.id.au> Analysed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:47:17 +02:00
David Gibson	4e73e9bd65	tcp: Don't special case the handling of the ack of a syn TCP treats the SYN packets as though they occupied 1 byte in the logical data stream described by the sequence numbers. That is, the very first ACK (or SYN-ACK) each side sends should acknowledge a sequence number one greater than the initial sequence number given in the SYN or SYN-ACK it's responding to. In passt we were tracking that by advancing conn->seq_to_tap by one when we send a SYN or SYN-ACK (in tcp_send_flag()). However, we also initialized conn->seq_ack_from_tap, representing the acks we've already seen from the tap side, to ISN+1, meaning we treated it has having acknowledged the SYN before it actually did. There were apparently reasons for this in earlier versions, but it causes problems now. Because of this when we actually did receive the initial ACK or SYN-ACK, we wouldn't see the acknoweldged serial number as advancing, and so wouldn't clear the ACK_FROM_TAP_DUE flag. In most cases we'd get away because subsequent packets would clear the flag. However if one (or both) sides didn't send any data, the other side would (correctly) keep sending ISN+1 as the acknowledged sequence number, meaning we would never clear the ACK_FROM_TAP_DUE flag. That would mean we'd treat the connection as if we needed to retransmit (although we had 0 bytes to retransmit), and eventaully (after around 30s) reset the connection due to too many retransmits. Specifically this could cause the iperf3 throughput tests in the testsuite to fail if set for a long enough test period. Correct this by initializing conn->seq_ack_from_tap to the ISN and only advancing it when we actually get the first ACK (or SYN-ACK). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:47:07 +02:00
David Gibson	085672f77c	tcp: Clarify allowed state for tcp_data_from_tap() Comments suggest that this should only be called for an ESTABLISHED connection. However, it's non-trivial to ascertain that from the actual control flow in the caller. Add an ASSERT() to make it very clear that this is only called in ESTABLISHED state. In fact, there were some circumstances where it could be called on a CLOSED connection. In a sense that is "established", but with that assert this does require specific (trivial) handling to avoid a spurious abort(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-29 13:46:58 +02:00
Stefano Brivio	1ee2f7cada	tcp: Don't reset ACK_TO_TAP_DUE on any ACK, reschedule timer as needed This is mostly symmetric with commit `cc6d8286d1` ("tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer"): we shouldn't reset the ACK_TO_TAP_DUE flag on any inbound ACK segment, but only once we acknowledge everything we received from the guest or the container. If we don't, a client might unnecessarily hold off further data, especially during slow start, and in general we won't converge to the usable bandwidth. This is very visible especially with traffic tests on links with non-negligible latency, such as in the reported issue. There, a public iperf3 server sometimes aborts the test due do what appears to be a low iperf3's --rcv-timeout (probably less than a second). Even if this doesn't happen, the throughput will converge to a fraction of the usable bandwidth. Clear ACK_TO_TAP_DUE if we acknowledged everything, set it if we didn't, and reschedule the timer in case the flag is still set as the timer expires. While at it, decrease the ACK timer interval to 10ms. A 50ms interval is short enough for any bandwidth-delay product I had in mind (local connections, or non-local connections with limited bandwidth), but here I am, testing 1gbps transfers to a peer with 100ms RTT. Indeed, we could eventually make the timer interval dependent on the current window and estimated bandwidth-delay product, but at least for the moment being, 10ms should be long enough to avoid any measurable syscall overhead, yet usable for any real-world application. Reported-by: Lukas Mrtvy <lukas.mrtvy@gmail.com> Link: https://bugs.passt.top/show_bug.cgi?id=44 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 23:14:58 +01:00
Stefano Brivio	9ffccf7acc	tcp: When a connection flag it set, don't negate it for debug print Fix a copy and paste typo I added in commit `5474bc5485` ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") and --debug altogether. Fixes: `5474bc5485` ("tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls()") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 19:39:55 +01:00
David Gibson	89d1494974	Fix false positive if cppcheck doesn't give a false positive `da46fdac` "tcp: Suppress knownConditionTrueFalse cppcheck false positive" introduced a suppression to work around a cppcheck bug causing a false positive warning. However, the suppression will itself cause a spurious unmatchedSuppression warning if used with a version of cppcheck from before the bug was introduced. That includes the packaged version of cppcheck in Fedora. Suppress the unmatchedSuppression as well. Fixes: `da46fdac36` ("tcp: Suppress knownConditionTrueFalse cppcheck false positive") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 16:38:44 +01:00
David Gibson	34ade90957	Work around weird false positives with cppcheck-2.9.1 Commit `89e38f55` "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: `89e38f5540` ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 16:38:06 +01:00
Paul Holzinger	418f75ac37	pasta: fix tcp port forwarding in auto mode The logic in tcp_timer() was inverted. fwd_out should expose the host ports in the ns. Therfore it must read the ports on the host and then bind them in the netns. The same for fwd_in which checks ports in the ns and then exposes them on the host. Note that this only fixes tcp ports, udp does not seems to work at all right now with the auto mode. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Fixes: `1128fa03fe` ("Improve types and names for port forwarding configuration") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 14:18:54 +01:00
Stefano Brivio	d7272f1df8	tcp: Clamp MSS value when queueing data to tap, also for pasta Tom reports that a pattern of repated ~1 MiB chunks downloads over NNTP over TLS, on Podman 4.4 using pasta as network back-end, results in pasta taking one full CPU thread after a while, and the download never succeeds. On that setup, we end up re-sending the same frame over and over, with a consistent 65 534 bytes size, and never get an acknowledgement from the tap-side client. This only happens for the default MTU value (65 520 bytes) or for values that are slightly smaller than that (down to 64 499 bytes). We hit this condition because the MSS value we use in tcp_data_from_sock(), only in pasta mode, is simply clamped to USHRT_MAX, and not to the actual size of the buffers we pre-cooked for sending, which is a bit less than that. It looks like we got away with it until commit `0fb7b2b908` ("tap: Use different io vector bases depending on tap type") fixed the setting of iov_len. Luckily, since it's pasta, we're queueing up to two frames at a time, so the worst that can happen is a badly segmented TCP stream: we always have some space at the tail of the buffer. Clamp the MSS value to the appropriate maximum given by struct tcp{4,6}_buf_data_t, no matter if we're running in pasta or passt mode. While at it, fix the comments to those structs to reflect the current struct size. This is not really relevant for any further calculation or consideration, but it's convenient to know while debugging this kind of issues. Thanks to Tom for reporting the issue in a very detailed way and for providing a test setup. Reported-by: Tom Mombourquette <tom@devnode.com> Link: https://github.com/containers/podman/issues/17703 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Stefano Brivio	5aea2f88ab	tcp, udp: Fix partial success return codes in {tcp,udp}_sock_init() The comments say we should return 0 on partial success, and an error code on complete failure. Rationale: if the user configures a port forwarding, and we succeed to bind that port for IPv4 or IPv6 only, that might actually be what the user intended. Adjust the two functions to reflect the comments. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Stefano Brivio	73992c42ce	tcp, udp, util: Pass socket creation errors all the way up ...starting from sock_l4(), pass negative error (errno) codes instead of -1. They will only be used in two commits from now, no functional changes intended here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Chris Kuhn	89e38f5540	treewide: Fix header includes to build with musl Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Stefano Brivio	a9c59dd91b	conf, icmp, tcp, udp: Add options to bind to outbound address and interface I didn't notice earlier: libslirp (and slirp4netns) supports binding outbound sockets to specific IPv4 and IPv6 addresses, to force the source addresse selection. If we want to claim feature parity, we should implement that as well. Further, Podman supports specifying outbound interfaces as well, but this is simply done by resolving the primary address for an interface when the network back-end is started. However, since kernel version 5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for non-root users"), we can actually bind to a specific interface name, which doesn't need to be validated in advance. Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses, and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets to given interfaces. Given that it probably makes little sense to select addresses and routes from interfaces different than the ones given for outbound sockets, also assign those as "template" interfaces, by default, unless explicitly overridden by '-i'. For ICMP and UDP, we call sock_l4() to open outbound sockets, as we already needed to bind to given ports or echo identifiers, and we can bind() a socket only once: there, pass address (if any) and interface (if any) for the existing bind() and setsockopt() calls. For TCP, in general, we wouldn't otherwise bind sockets. Add a specific helper to do that. For UDP outbound sockets, we need to know if the final destination of the socket is a loopback address, before we decide whether it makes sense to bind the socket at all: move the block mangling the address destination before the creation of the socket in the IPv4 path. This was already the case for the IPv6 path. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:43:59 +01:00
Stefano Brivio	4f523c3276	tcp: Avoid (theoretical) resource leak (CWE-772) Coverity warning If tcp_timer_ctl() gets a socket number greater than SOCKET_MAX (2 ^ 24), we return error but we don't close the socket. This is a rather formal issue given that, at least on Linux, socket numbers are monotonic and we're in general not allowed to open so many sockets. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:55:20 +01:00
Stefano Brivio	a1d5537741	tcp: Avoid false (but convoluted) positive Coverity CWE-476 warning If there are no TCP options in the header, tcp_tap_handler() will pass the corresponding pointer, fetched via packet_get(), as NULL to tcp_conn_from_sock_finish(), which in turn indirectly calls tcp_opt_get(). If there are no options, tcp_opt_get() will stop right away because the option length is indicated as zero. However, if the logic is complicated enough to follow for static checkers, adding an explicit check against NULL in tcp_opt_get() is probably a good idea. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:55:10 +01:00
Stefano Brivio	5474bc5485	tcp, tcp_splice: Get rid of false positive CWE-394 Coverity warning from fls() We use the return value of fls() as array index for debug strings. While fls() can return -1 (if no bit is set), Coverity Scan doesn't see that we're first checking the return value of another fls() call with the same bitmask, before using it. Call fls() once, store its return value, check it, and use the stored value as array index. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:54:38 +01:00
Stefano Brivio	a48c5c2abf	treewide: Disable gcc strict aliasing rules as needed, drop workarounds Recently, commit `4ddbcb9c0c` ("tcp: Disable optimisations for tcp_hash()") worked around yet another issue we hit with gcc 12 and '-flto -O2': some stores affecting the input data to siphash_20b() were omitted altogether, and tcp_hash() wouldn't get the correct hash for incoming connections. Digging further into this revealed that, at least according to gcc's interpretation of C99 aliasing rules, passing pointers to functions with different types compared to the effective type of the object (for example, a uint8_t pointer to an anonymous struct, as it happens in tcp_hash()), doesn't guarantee that stores are not reordered across the function call. This means that, in general, our checksum and hash functions might not see parts of input data that was intended to be provided by callers. Not even switching from uint8_t to character types, which should be appropriate here, according to C99 (ISO/IEC 9899, TC3, draft N1256), section 6.5, "Expressions", paragraph 7: An object shall have its stored value accessed only by an lvalue expression that has one of the following types: [...] — a character type. does the trick. I guess this is also subject to interpretation: casting passed pointers to character types, and then using those as different types, might still violate (dubious) aliasing rules. Disable gcc strict aliasing rules for potentially affected functions, which, in turn, disables gcc's Type-Based Alias Analysis (TBAA) optimisations based on those function arguments. Drop the existing workarounds. Also the (seemingly?) bogus 'maybe-uninitialized' warning on the tcp_tap_handler() > tcp_hash() > siphash_20b() path goes away with -fno-strict-aliasing on siphash_20b(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:54:13 +01:00
Stefano Brivio	da46fdac36	tcp: Suppress knownConditionTrueFalse cppcheck false positive cppcheck 2.10 reports: tcp.c:1815:12: style: Condition 'wnd>prev_scaled' is always false [knownConditionTrueFalse] if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ tcp.c:1808:8: note: Assignment 'wnd=((1<<(16+8))<(wnd))?(1<<(16+8)):(wnd)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd); ^ tcp.c:1811:19: note: Assuming condition is false if (prev_scaled == wnd) ^ tcp.c:1815:12: note: Condition 'wnd>prev_scaled' is always false if ((wnd > prev_scaled && wnd * 99 / 100 < prev_scaled) \|\| ^ but this is not actually the case: wnd is typically greater than 1, and might very well be greater than prev_scaled as well. I bisected this down to cppcheck commit b4d455df487c ("Fix 11349: FP negativeIndex for clamped array index (#4627)") and reported findings at https://github.com/danmar/cppcheck/pull/4627. Suppress the warning for the moment being. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-27 18:53:45 +01:00
Stefano Brivio	4ddbcb9c0c	tcp: Disable optimisations for tcp_hash() I'm not sure if we're breaking some aliasing rule here, but with gcc 12.2.1 on x86_64 and -flto, the siphash_20b() call in tcp_hash() doesn't see the connection address -- it gets all zeroes instead. Fix this temporarily by disabling optimisations for this tcp_hash(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-22 13:16:22 +01:00
David Gibson	89e0fbfaa7	tcp: Remove 'zero_len' goto from tcp_data_from_sock This goto exists purely to move this exception case out of line. Although that does make the "normal" path a little clearer, it comes at the cost of not knowing how where control will flow after jumping to the zero_len label. The exceptional case isn't that long, so just put it inline. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 18:56:37 +01:00
David Gibson	83b2061ae7	tcp: Remove 'recvmsg' goto from tcp_data_from_sock This goto can be handled just as simply and more clearly with a do while. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 18:56:35 +01:00
Stefano Brivio	3d0de2c1d7	conf, tcp, udp: Exit if we fail to bind sockets for all given ports passt supports ranges of forwarded ports as well as 'all' for TCP and UDP, so it might be convenient to proceed if we fail to bind only some of the desired ports. But if we fail to bind even a single port for a given specification, we're clearly, unexpectedly, conflicting with another network service. In that case, report failure and exit. Reported-by: Yalan Zhang <yalzhang@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-16 17:33:49 +01:00
David Gibson	6ccab72d9b	tcp: Improve handling of fallback if socket pool is empty on new splice When creating a new spliced connection, we need to get a socket in the other ns from the originating one. To avoid excessive ns switches we usually get these from a pool refilled on a timer. However, if the pool runs out we need a fallback. Currently that's done by passing -1 as the socket to tcp_splice_connnect() and running it in the target ns. This means that tcp_splice_connect() itself needs to have different cases depending on whether it's given an existing socket or not, which is a separate concern from what it's mostly doing. We change it to require a suitable open socket to be passed in, and ensuring in the caller that we have one. This requires adding the fallback paths to the caller, tcp_splice_new(). We use slightly different approaches for a socket in the init ns versus the guest ns. This also means that we no longer need to run tcp_splice_connect() itself in the guest ns, which allows us to remove a bunch of boilerplate code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:25:14 +01:00
David Gibson	dc467d526f	tcp: Split pool lookup from creating new sockets in tcp_conn_new_sock() tcp_conn_new_sock() first looks for a socket in a pre-opened pool, then if that's empty creates a new socket in the init namespace. Both parts of this are duplicated in other places: the pool lookup logic is duplicated in tcp_splice_new(), and the socket opening logic is duplicated in tcp_sock_refill_pool(). Split the function into separate parts so we can remove both these duplications. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:25:11 +01:00
David Gibson	912d37cd5b	tcp: Move socket pool declarations around tcp_splice.c has some explicit extern declarations to access the socket pools. This is pretty dangerous - if we changed the type of these variables in tcp.c, we'd have tcp.c and tcp_splice.c using the same memory in different ways with no compiler error. So, move the extern declarations to tcp_conn.h so they're visible to both tcp.c and tcp_splice.c, but not the rest of pasta. In fact the pools for the guest namespace are necessarily only used by tcp_splice.c - we have no sockets on the guest side if we're not splicing. So move those declarations and the functions that deal exclusively with them to tcp_splice.c Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:25:08 +01:00
David Gibson	c8993476d5	tcp: Split init and ns cases for tcp_sock_refill() With the creation of the tcp_sock_refill_pool() helper, the ns==true and ns==false paths for tcp_sock_refill() now have almost nothing in common. Split the two versions into tcp_sock_refill_init() and tcp_sock_refill_ns() functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:24:51 +01:00
David Gibson	e456c1ccdc	tcp: Make a helper to refill each socket pool tcp_sock_refill() contains two near-identical loops to refill the IPv4 and IPv6 socket pools. In addition, if we get an error on the IPv4 pool we exit early and won't attempt to refill the IPv6 pool. At least theoretically, these are independent from each other and there's value to filling up either pool without the other. So, there's no strong reason to give up on one because the other failed. Address both of these with a helper function 'tcp_sock_refill_pool()' to refill a single given pool. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-14 17:24:40 +01:00
David Gibson	7a8ed9459d	Make assertions actually useful There are some places in passt/pasta which #include <assert.h> and make various assertions. If we hit these something has already gone wrong, but they're there so that we a useful message instead of cryptic misbehaviour if assumptions we thought were correct turn out not to be. Except.. the glibc implementation of assert() uses syscalls that aren't in our seccomp filter, so we'll get a SIGSYS before it actually prints the message. Work around this by adding our own ASSERT() implementation using our existing err() function to log the message, and an abort(). The abort() probably also won't work exactly right with seccomp, but once we've printed the message, dying with a SIGSYS works just as well as dying with a SIGABRT. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 23:42:24 +01:00
Stefano Brivio	cc6d8286d1	tcp: Reset ACK_FROM_TAP_DUE flag only as needed, update timer David reports that TCP transfers might stall, especially with smaller socket buffer sizes, because we reset the ACK_FROM_TAP_DUE flag, in tcp_tap_handler(), whenever we receive an ACK segment, regardless of its sequence number and the fact that we might still be waiting for one. This way, we might fail to re-transmit frames on ACK timeouts. We need, instead, to: - indicate with the @retrans field only re-transmissions for the same data sequences. If we make progress, it should be reset, given that it's used to abort a connection when we exceed a given number of re-transmissions for the same data - unset the ACK_FROM_TAP_DUE flag if and only if the acknowledged sequence is the same as the last one we sent, as suggested by David - keep it set otherwise, if progress was done but not all the data we sent was acknowledged, and update the expiration of the ACK timeout Add a new helper for these purposes, tcp_update_seqack_from_tap(). To extend the ACK timeout, the new helper sets the ACK_FROM_TAP_DUE flag, even if it was already set, and conn_flag_do() triggers a timer update. This part should be revisited at a later time, because, strictly speaking, ACK_FROM_TAP_DUE isn't a flag anymore. One possibility might be to introduce another connection attribute for events affecting timer deadlines. Reported-by: David Gibson <david@gibson.dropbear.id.au> Link: https://bugs.passt.top/show_bug.cgi?id=41 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Fixes: `be5bbb9b06` ("tcp: Rework timers to use timerfd instead of periodic bitmap scan") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 22:26:55 +01:00
David Gibson	2416310a17	tcp: Use abstracted tap header Update the TCP code to use the tap layer abstractions for initializing and updating the L2 and lower headers. This will make adding other tap backends in future easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:54 +01:00
David Gibson	e4443ba9bd	tcp: Consolidate calculation of total frame size tcp_l2_buf_fill_headers() returns the size of the generated frame including the ethernet header. The caller then adds on the size of the vnet_len field to get the total frame size to be passed to the tap device. Outside the tap code, though, we never care about the ethernet header size only the final total size we need to put into an iovec. So, consolidate the total frame size calculation within tcp_l2_buf_fill_headers(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:49 +01:00
David Gibson	dcea26076d	tcp: Remove redundant and incorrect initialization from *_iov_init() tcp_sock[46]_iov_init() initialize the length of each iovec buffer to MSS_DEFAULT. That will always be overwritten before use in tcp_data_to_tap, so it's redundant. It also wasn't correct, because it didn't correctly account for the header lengths in all cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:47 +01:00
David Gibson	716a926ef4	util: Parameterize ethernet header initializer macro We have separate IPv4 and IPv6 versions of a macro to construct an initializer for ethernet headers. However, now that we have htons_constant it's easy to simply paramterize this with the ethernet protocol number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:46 +01:00
David Gibson	67afaab411	tcp, udp: Use named field initializers in iov_init functions Both the TCP and UDP iov_init functions have some large structure literals defined in "field order" style. These are pretty hard to read since it's not obvious what value corresponds to what field. Use named field style initializers instead to make this clearer. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:44 +01:00
David Gibson	6d011c1faa	tap, tcp: Move tap send path to tap.c The functions which do the final steps of sending TCP packets on through the tap interface - tcp_l2_buf_flush*() - no longer have anything that's actually specific to TCP in them, other than comments and names. Move them all to tap.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:40 +01:00
David Gibson	e21ee41ac3	tcp: Combine two parts of pasta tap send path together tcp_l2_buf_flush() open codes the loop across each frame in a group, but but calls tcp_l2_buf_write_one() to send each frame to the pasta tuntap device. Combine these two pasta-specific operations into tcp_l2_buf_flush_pasta() which is a little cleaner and will enable further cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:39 +01:00
David Gibson	a79e774770	tcp: Improve interface to tcp_l2_buf_flush() Currently this takes a msghdr, but the only thing we actually care about in there is the io vector. Make it take an io vector directly. We also have a weird side effect of zeroing @buf_used. Just pass this by value and zero it in the caller instead. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:36 +01:00
David Gibson	f1d2d35416	tcp: Don't compute total bytes in a message until we need it tcp[46]_l2_buf_bytes keep track of the total number of bytes we have queued to send to the tap interface. tcp_l2_buf_flush_passt() uses this to determine if sendmsg() has sent all the data we requested, or whether we need to resend a trailing portion. However, the logic for finding where we're up to in the case of a short sendmsg() can equally well tell whether we've had one at all, without knowing the total number in advance. This does require an extra loop after each sendmsg(), but it's doing simple arithmetic on values we've already been accessing, and it leads to overall simpler code. tcp[46]_l2_flags_buf_bytes were being calculated, but never used for anything, so simply remove them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:33 +01:00
David Gibson	99f0be3cad	tcp: Combine two parts of passt tap send path together tcp_l2_buf_flush() open codes the "primary" send of message to the passt tap interface, but calls tcp_l2_buf_flush_part() to handle the case of a short send. Combine these two passt-specific operations into tcp_l2_buf_flush_passt() which is a little cleaner and will enable furrther cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:30 +01:00
David Gibson	d3089eb0ea	pcap: Replace pcapm() with pcap_multiple() pcapm() captures multiple frames from a msghdr, however the only thing it cares about in the msghdr is the list of buffers, where it assumes there is one frame to capture per buffer. That's what we want for its single caller but it's not the only obvious choice here (one frame per msghdr would arguably make more sense in isolation). In addition pcapm() has logic that only makes sense in the context of the passt specific path its called from: it skips the first 4 bytes of each buffer, because those have the qemu vnet_len rather than the frame proper. Make this clearer by replacing pcapm() with pcap_multiple() which more explicitly takes one struct iovec per frame, and parameterizes how much of each buffer to skip (i.e. the offset of the frame within the buffer). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:27 +01:00
Stefano Brivio	37f82ccd9f	tcp: Explicitly check option length field values in tcp_opt_get() Reported by Coverity (CWE-606, Untrusted loop bound), and actually harmless because we'll exit the option-scanning loop if the remaining length is not enough for a new option, instead of reading past the header. In any case, it looks like a good idea to explicitly check for reasonable values of option lengths. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-01-05 15:08:07 +01:00
Stefano Brivio	b06014a6b2	tcp: Pass union tcp_conn pointer to destroy and splice timer functions The pointers are actually the same, but we later pass the container union to tcp_table_compact(), which might zero the size of the whole union, and this confuses Coverity Scan. Given that we have pointers to the container union to start with, just pass those instead, all the way down to tcp_table_compact(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-25 01:37:36 +01:00
David Gibson	8e914238b6	tcp: Use dual stack sockets for port forwarding when possible Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as well as native IPv6 connections. By doing this we halve the number of listening sockets we need for TCP (assuming passt/pasta is listening on the same ports for IPv4 and IPv6). When forwarding many ports (e.g. -t all) this can significantly reduce the amount of kernel memory that passt consumes. When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all -u all), this reduces kernel memory usage from ~677MiB to ~487MiB (kernel version 6.0.8 on Fedora 37, x86_64). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:36:04 +01:00
David Gibson	ebf56c7b56	tcp: Consolidate tcp_sock_init[46] Previous cleanups mean that tcp_sock_init4() and tcp_sock_init6() are almost identical, and the remaining differences can be easily parameterized. Combine both into a single tcp_sock_init_af() function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:55 +01:00
David Gibson	bb6bb5649d	tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses passt usually doesn't NAT, but it does do so for the remapping of the gateway address to refer to the host. Currently we perform this NAT with slightly different rules on both IPv4 addresses and IPv6 addresses, but not on IPv4-mapped IPv6 addresses. This means we won't correctly handle the case of an IPv4 connection over an IPv6 socket, which is possible on Linux (and probably other platforms). Refactor tcp_conn_from_sock() to perform the NAT after converting either address family into an inany_addr, so IPv4 and and IPv4-mapped addresses have the same representation. With two new helpers this lets us remove the IPv4 and IPv6 specific paths from tcp_conn_from_sock(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:51 +01:00
David Gibson	034fa8a58d	tcp: Remove v6 flag from tcp_epoll_ref This bit in the TCP specific epoll reference indicates whether the connection is IPv6 or IPv4. However the sites which refer to it are already calling accept() which (optionally) returns an address for the remote end of the connection. We can use the sa_family field in that address to determine the connection type independent of the epoll reference. This does have a cost: for the spliced case, it means we now need to get that address from accept() which introduces an extran copy_to_user(). However, in future we want to allow handling IPv4 connectons through IPv6 sockets, which means we won't be able to determine the IP version at the time we create the listening socket and epoll reference. So, at some point we'll have to pay this cost anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:48 +01:00
David Gibson	7f1f2f3f51	tcp: Fix small errors in tcp_seq_init() time handling It looks like tcp_seq_init() is supposed to advance the sequence number by one every 32ns. However we only right shift the ns part of the timespec not the seconds part, meaning that we'll advance by an extra 32 steps on each second. I don't know if that's exploitable in any way, but it doesn't appear to be the intent, nor what RFC 6528 suggests. In addition, we convert from seconds to nanoseconds with a multiplication by '1E9'. In C '1E9' is a floating point constant, forcing a conversion to floating point and back for what should be an integer calculation (confirmed with objdump and Makefile default compiler flags). Spell out 1000000000 in full to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:45 +01:00
David Gibson	e7bae5bafd	tcp: Have tcp_seq_init() take its parameters from struct tcp_conn tcp_seq_init() takes a number of parameters for the connection, but at every call site, these are already populated in the tcp_conn structure. Likewise we always store the result into the @seq_to_tap field. Use this to simplify tcp_seq_init(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:43 +01:00
David Gibson	a93bfada5b	tcp: Unify initial sequence number calculation for IPv4 and IPv6 tcp_seq_init() has separate paths for IPv4 and IPv6 addresses, which means we will calculate different sequence numbers for IPv4 and equivalent IPv4-mapped IPv6 addresses. Change it to treat these the same by always converting the input address into an inany_addr representation and use that to calculate the sequence number. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:35:41 +01:00

1 2 3 4 5

224 Commits