passt

mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00

Author	SHA1	Message	Date
David Gibson	58e6d68599	test: Clarify test for spliced inbound transfers The tests in pasta/tcp and pasta/udp for inbound transfers have the server listening within the namespace explicitly bound to 127.0.0.1 or ::1. This only works because of the behaviour of inbound splice connections, which always appear with both source and destination addresses as loopback in the namespace. That's not an inherent property for "spliced" connections and arguably an undesirable one. Also update the test names to make it clearer that these tests are expecting to exercise the "splice" path. Interestingly this was already correct for the equivalent passt_in_ns/*, although we also update the test names for clarity there. Note that there are similar issues in some of the podman tests, addressed in https://github.com/containers/podman/pull/24064 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:28:00 +02:00
David Gibson	1fa421192c	passt.1: Clarify and update "Handling of local addresses" section This section didn't mention the effect of the --map-host-loopback option which now alters this behaviour. Update it accordingly. It used "local addresses" to mean specifically 127.0.0.0/8 and ::1. However, "local" could also refer to link-local addresses or to addresses of any scope which happen to be configured on the host. Use "loopback address" to be more precise about this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:57 +02:00
David Gibson	ef8a5161d0	passt.1: Mark --stderr as deprecated more prominently The description of this option says that it's deprecated, but unlike --no-copy-addrs and --no-copy-routes it doesn't have a clear label. Add one to make it easier to spot. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:54 +02:00
David Gibson	53176ca91d	test: Wait for DAD on DHCPv6 addresses After running dhclient -6 we expect the DHCPv6 assigned address to be immediately usable. That's true with the Fedora dhclient-script (and the upstream ISC DHCP one), however it's not true with the Debian dhclient-script. The Debian script can complete with the address still in "tentative" state, and the address won't be usable until Duplicate Address Detection (DAD) completes. That's arguably a bug in Debian (see link below), but for the time being we need to work around it anyway. We usually get away with this, because by the time we do anything where the address matters, DAD has completed. However, it's not robust, so we should explicitly wait for DAD to complete when we get an DHCPv6 address. Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085231 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:51 +02:00
David Gibson	75b9c0feb0	test: Explicitly wait for DAD to complete on SLAAC addresses Getting a SLAAC address takes a little while because the kernel must complete Duplicate Address Detection (DAD) before marking the address as ready. In several places we have an explicit 'sleep 2' to wait for that to complete. Fixed length delays are never a great idea, although this one is pretty solid. Still, it would be better to explicitly wait for DAD to complete in case of long delays (which might happen on slow emulated hosts, or with heavy load), and to speed the tests up if DAD completes quicker. Replace the fixed sleeps with a loop waiting for DAD to complete. We do this by looping waiting for all tentative addresses to disappear. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:47 +02:00
David Gibson	f9d677bff6	arp: Fix a handful of small warts This fixes a number of harmless but slightly ugly warts in the ARP resolution code: * Use in4addr_any to represent 0.0.0.0 rather than hand constructing an example. * When comparing am->sip against 0.0.0.0 use sizeof(am->sip) instead of sizeof(am->tip) (same value, but makes more logical sense) * Described the guest's assigned address as such, rather than as "our address" - that's not usually what we mean by "our address" these days * Remove "we might have the same IP address" comment which I can't make sense of in context (possibly it's relating to the statement below, which already has its own comment?) Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-18 20:27:04 +02:00
Stefano Brivio	2d7f734c45	tcp: Send "empty" handshake ACK before first data segment Starting from commit `9178a9e346` ("tcp: Always send an ACK segment once the handshake is completed"), we always send an ACK segment, without any payload, to complete the three-way handshake while establishing a connection started from a socket. We queue that segment after checking if we already have data to send to the tap, which means that its sequence number is higher than any segment with data we're sending in the same iteration, if any data is available on the socket. However, in tcp_defer_handler(), we first flush "flags" buffers, that is, we send out segments without any data first, and then segments with data, which means that our "empty" ACK is sent before the ACK segment with data (if any), which has a lower sequence number. This appears to be harmless as the guest or container will generally reorder segments, but it looks rather weird and we can't exclude it's actually causing problems. Queue the empty ACK first, so that it gets a lower sequence number, before checking for any data from the socket. Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-15 20:34:26 +02:00
Stefano Brivio	7612cb80fe	test: Pass TRACE from run_term() into ./run from_term Just like we do for PCAP, DEBUG and KERNEL. Otherwise, running tests with TRACE=1 will not actually enable tracing output. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-10 05:25:19 +02:00
Stefano Brivio	b40880c157	test/lib/term: Always use printf for messages with escape sequences ...instead of echo: otherwise, bash won't handle escape sequences we use to colour messages (and 'echo -e' is not specified by POSIX). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-10 05:25:00 +02:00
David Gibson	ff63ac922a	conf: Add --dns-host option to configure host side nameserver When redirecting DNS queries with the --dns-forward option, passt/pasta needs a host side nameserver to redirect the queries to. This is controlled by the c->ip[46].dns_host variables. This is set to the first first nameserver listed in the host's /etc/resolv.conf, and there isn't currently a way to override it from the command line. Prior to `0b25cac9` ("conf: Treat --dns addresses as guest visible addresses") it was possible to alter this with the -D/--dns option. However, doing so was confusing and had some nonsensical edge cases because -D generally takes guest side addresses, rather than host side addresses. Add a new --dns-host option to restore this functionality in a more sensible way. Link: https://bugs.passt.top/show_bug.cgi?id=102 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 19:04:29 +02:00
David Gibson	9d66df9a9a	conf: Add command line switch to enable IP_FREEBIND socket option In a couple of recent reports, we've seen that it can be useful for pasta to forward ports from addresses which are not currently configured on the host, but might be in future. That can be done with the sysctl net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in the first place. We can allow the same thing on a per-socket basis with the IP_FREEBIND (or IPV6_FREEBIND) socket option. Add a --freebind command line argument to enable this socket option on all listening sockets. Link: https://bugs.passt.top/show_bug.cgi?id=101 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 19:04:29 +02:00
Laurent Vivier	151dbe0d3d	udp: Update UDP checksum using an iovec array As for tcp_update_check_tcp4()/tcp_update_check_tcp6(), change csum_udp4() and csum_udp6() to use an iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:13 +02:00
Laurent Vivier	3d484aa370	tcp: Update TCP checksum using an iovec array TCP header and payload are supposed to be in the same buffer, and tcp_update_check_tcp4()/tcp_update_check_tcp6() compute the checksum from the base address of the header using the length of the IP payload. In the future (for vhost-user) we need to dispatch the TCP header and the TCP payload through several buffers. To be able to manage that, we provide an iovec array that points to the data of the TCP frame. We provide also an offset to be able to provide an array that contains the TCP frame embedded in an lower level frame, and this offset points to the TCP header inside the iovec array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:10 +02:00
Laurent Vivier	e6548c6437	checksum: Add an offset argument in csum_iov() The offset allows any headers that are not part of the data to checksum to be skipped. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:08 +02:00
Laurent Vivier	fd8334b25d	pcap: Add an offset argument in pcap_iov() The offset is passed directly to pcap_frame() and allows any headers that are not part of the frame to capture to be skipped. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:51:02 +02:00
Laurent Vivier	72e7d3024b	tcp: Use tcp_payload_t rather than tcphdr As tcp_update_check_tcp4() and tcp_update_check_tcp6() compute the checksum using the TCP header and the TCP payload, it is clearer to use a pointer to tcp_payload_t that includes tcphdr and payload rather than a pointer to tcphdr (and guessing TCP header is followed by the payload). Move tcp_payload_t and tcp_flags_t to tcp_internal.h. (They will be used also by vhost-user). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-10-04 14:50:46 +02:00
Stefano Brivio	def8acdcd8	test: Kernel binary can now be passed via the KERNEL environmental variable This is quite useful at least for myself as I'm usually running tests using a guest kernel that's not the same as the one on the host. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-10-02 14:50:34 +02:00
David Gibson	b55013b1a7	inany: Add inany_pton() helper We already have an inany_ntop() function to format inany addresses into text. Add inany_pton() to parse them from text, and use it in conf_ports(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:17 +02:00
David Gibson	cbde4192ee	tcp, udp: Make {tcp,udp}_sock_init() take an inany address tcp_sock_init() and udp_sock_init() take an address to bind to as an address family and void * pair. Use an inany instead. Formerly AF_UNSPEC was used to indicate that we want to listen on both 0.0.0.0 and ::, now use a NULL inany to indicate that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:16 +02:00
David Gibson	b8d4fac6a2	util, pif: Replace sock_l4() with pif_sock_l4() The sock_l4() function is very convenient for creating sockets bound to a given address, but its interface has some problems. Most importantly, the address and port alone aren't enough in some cases. For link-local addresses (at least) we also need the pif in order to properly construct a socket adddress. This case doesn't yet arise, but it might cause us trouble in future. Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it should attempt to create a "dual stack" socket which will respond to both IPv4 and IPv6 traffic. This only makes sense if there is no specific address given. We verify this at runtime, but it would be nicer if we could enforce it structurally. For sockets associated specifically with a single flow we already replaced sock_l4() with flowside_sock_l4() which avoids those problems. Now, replace all the remaining users with a new pif_sock_l4() which also takes an explicit pif. The new function takes the address as an inany *, with NULL indicating the dual stack case. This does add some complexity in some of the callers, however future planned cleanups should make this go away again. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:15 +02:00
David Gibson	204e77cd11	udp: Don't attempt to get dual-stack sockets in nonsensical cases To save some kernel memory we try to use "dual stack" sockets (that listen to both IPv4 and IPv6 traffic) when possible. However udp_sock_init() attempts to do this in some cases that can't work. Specifically we can only do this when listening on any address. That's never true for the ns (splicing) case, because we always listen on loopback. For the !ns case and AF_UNSPEC case, addr should always be NULL, but add an assert to verify. This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll fall back to the other path. But, it's messy and makes some upcoming changes harder, so avoid attempting this in cases we know can't work. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-25 19:03:15 +02:00
Laurent Vivier	8f8c4d27eb	tcp: Allow checksum to be disabled We can need not to set TCP checksum. Add a parameter to tcp_fill_headers4() and tcp_fill_headers6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:15:28 +02:00
Laurent Vivier	4fe5f4e813	udp: Allow checksum to be disabled We can need not to set the UDP checksum. Add a parameter to udp_update_hdr4() and udp_update_hdr6() to disable it. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:15:20 +02:00
David Gibson	d836d9e345	util: Remove possible quadratic behaviour from write_remainder() write_remainder() steps through the buffers in an IO vector writing out everything past a certain byte offset. However, on each iteration it rescans the buffer from the beginning to find out where we're up to. With an unfortunate set of write sizes this could lead to quadratic behaviour. In an even less likely set of circumstances (total vector length > maximum size_t) the 'skip' variable could overflow. This is one factor in a longstanding Coverity error we've seen (although I still can't figure out the remainder of its complaint). Rework write_remainder() to always work out our new position in the vector relative to our old/current position, rather than starting from the beginning each time. As a bonus this seems to fix the Coverity error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:15:03 +02:00
David Gibson	bfc294b90d	util: Add helper to write() all of a buffer write(2) might not write all the data it is given. Add a write_all_buf() helper to keep calling it until all the given data is written, or we get an error. Currently we use write_remainder() to do this operation in pcap_frame(). That's a little awkward since it requires constructing an iovec, and future changes we want to make to write_remainder() will be easier in terms of this single buffer helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:59 +02:00
David Gibson	bb41901c71	tcp: Make tcp_update_seqack_wnd()s force_seq parameter explicitly boolean This parameter is already treated as a boolean internally. Make it a 'bool' type for clarity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:55 +02:00
David Gibson	265b2099c7	tcp: Simplify ifdef logic in tcp_update_seqack_wnd() This function has a block conditional on !snd_wnd_cap shortly before an snd_wnd_cap is statically false). Therefore, simplify this down to a single conditional with an else branch. While we're there, fix some improperly indented closing braces. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:50 +02:00
David Gibson	4aff6f9392	tcp: Clean up tcpi_snd_wnd probing When available, we want to retrieve our socket peer's advertised window and forward that to the guest. That information has been available from the kernel via the TCP_INFO getsockopt() since kernel commit 8f7baad7f035. Currently our probing for this is a bit odd. The HAS_SND_WND define determines if our headers include the tcp_snd_wnd field, but that doesn't necessarily mean the running kernel supports it. Currently we start by assuming it's _not_ available, but mark it as available if we ever see a non-zero value in the field. This is a bit hit and miss in two ways: * Zero is perfectly possible window the peer could report, so we can get false negatives * We're reading TCP_INFO into a local variable, which might not be zero initialised, so if the kernel _doesn't_ write it it could have non-zero garbage, giving us false positives. We can use a more direct way of probing for this: getsockopt() reports the length of the information retreived. So, check whether that's long enough to include the field. This lets us probe the availability of the field once and for all during initialisation. That in turn allows ctx to become a const pointer to tcp_prepare_flags() which cascades through many other functions. We also move the flag for the probe result from the ctx structure to a global, to match peek_offset_cap. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:47 +02:00
David Gibson	7d8804beb8	tcp: Make some extra functions private tcp_send_flag() and tcp_probe_peek_offset_cap() are not used outside tcp.c, and have no prototype in a header. Make them static. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-18 17:14:33 +02:00
David Gibson	5ff5d55291	tcp: Avoid overlapping memcpy() in DUP_ACK handling When handling the DUP_ACK flag, we copy all the buffers making up the ack frame. However, all our frames share the same buffer for the Ethernet header (tcp4_eth_src or tcp6_eth_src), so copying the TCP_IOV_ETH will result in a (perfectly) overlapping memcpy(). This seems to have been harmless so far, but overlapping ranges to memcpy() is undefined behaviour, so we really should avoid it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-12 09:13:59 +02:00
David Gibson	1f414ed8f0	tcp: Remove redundant initialisation of iov[TCP_IOV_ETH].iov_base This initialisation for IPv4 flags buffers is redundant with the very next line which sets both iov_base and iov_len. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-12 09:13:46 +02:00
Stefano Brivio	6b38f07239	apparmor: Allow read access to /proc/sys/net/ipv4/ip_local_port_range ...for both passt and pasta: use passt's abstraction for this. Fixes: `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 15:34:06 +02:00
Stefano Brivio	116bc8266d	selinux: Allow read access to /proc/sys/net/ipv4/ip_local_port_range Since commit `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports"), we might need to read from /proc/sys/net/ipv4/ip_local_port_range in both passt and pasta. While pasta was already allowed to open and write /proc/sys/net entries, read access was missing in SELinux's type enforcement: add that. In passt, instead, this is the first time we need to access an entry there: add everything we need. Fixes: `eedc81b6ef` ("fwd, conf: Probe host's ephemeral ports") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 15:34:06 +02:00
David Gibson	a33ecafbd9	tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder. Adjust the code to make sure we always have room for a maximum size frame. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:46 +02:00
David Gibson	d2a1dc744b	tap: Restructure in tap_pasta_input() tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue. The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop. Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:43 +02:00
David Gibson	11e29054fe	tap: Improve handling of EINTR in tap_passt_input() When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling. So, handle EINTR separately by immediately retrying until we succeed or get a different type of error. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:41 +02:00
David Gibson	49fc4e0414	tap: Split out handling of EPOLLIN events Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 13:56:37 +02:00
Stefano Brivio	63513e54f3	util: Fix order of operands and carry of one second in timespec_diff_us() If the nanoseconds of the minuend timestamp are less than the nanoseconds of the subtrahend timestamp, we need to carry one second in the subtraction. I subtracted this second from the minuend, but didn't actually carry it in the subtraction of nanoseconds, and logged timestamps would jump back whenever we switched to the first branch of timespec_diff_us() from the second one. Most likely, the reason why I didn't carry the second is that I instinctively thought that swapping the operands would have the same effect. But it doesn't, in general: that only happens with arithmetic in modulo powers of 2. Undo the swap as well. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 13:01:34 +02:00
David Gibson	748ef4cd6e	cppcheck: Work around some cppcheck 2.15.0 redundantInitialization warnings cppcheck-2.15.0 has apparently broadened when it throws a warning about redundant initialization to include some cases where we have an initializer for some fields, but then set other fields in the function body. This is arguably a false positive: although we are technically overwriting the zero-initialization the compiler supplies for fields not explicitly initialized, this sort of construct makes sense when there are some fields we know at the top of the function where the initializer is, but others that require more complex calculation. That said, in the two places this shows up, it's pretty easy to work around. The results are arguably slightly clearer than what we had, since they move the parts of the initialization closer together. So do that rather than having ugly suppressions or dealing with the tedious process of reporting a cppcheck false positive. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:20 +02:00
Stefano Brivio	afedc2412e	tcp: Use EPOLLET for any state of not established connections Currently, for not established connections, we monitor sockets with edge-triggered events (EPOLLET) if we are in the TAP_SYN_RCVD state (outbound connection being established) but not in the TAP_SYN_ACK_SENT case of it (socket is connected, and we sent SYN,ACK to the container/guest). While debugging https://bugs.passt.top/show_bug.cgi?id=94, I spotted another possibility for a short EPOLLRDHUP storm (10 seconds), which doesn't seem to happen in actual use cases, but I could reproduce it: start a connection from a container, while dropping (using netfilter) ACK segments coming out of the container itself. On the server side, outside the container, accept the connection and shutdown the writing side of it immediately. At this point, we're in the TAP_SYN_ACK_SENT case (not just a mere TAP_SYN_RCVD state), we get EPOLLRDHUP from the socket, but we don't have any reasonable way to handle it other than waiting for the tap side to complete the three-way handshake. So we'll just keep getting this EPOLLRDHUP until the SYN_TIMEOUT kicks in. Always enable EPOLLET when EPOLLRDHUP is the only epoll event we subscribe to: in this case, getting multiple EPOLLRDHUP reports is totally useless. In the only remaining non-established state, SOCK_ACCEPTED, for inbound connections, we're anyway discarding EPOLLRDHUP events until we established the conection, because we don't know what to do with them until we get an answer from the tap side, so it's safe to enable EPOLLET also in that case. Link: https://bugs.passt.top/show_bug.cgi?id=94 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:54:16 +02:00
David Gibson	aff5a49b0e	udp: Handle more error conditions in udp_sock_errs() udp_sock_errs() reads out everything in the socket error queue. However we've seen some cases[0] where an EPOLLERR event is active, but there isn't anything in the queue. One possibility is that the error is reported instead by the SO_ERROR sockopt. Check for that case and report it as best we can. If we still get an EPOLLERR without visible error, we have no way to clear the error state, so treat it as an unrecoverable error. [0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010 Link: https://bugs.passt.top/show_bug.cgi?id=95 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:38 +02:00
David Gibson	bd99f02a64	udp: Treat errors getting errors as unrecoverable We can get network errors, usually transient, reported via the socket error queue. However, at least theoretically, we could get errors trying to read the queue itself. Since we have no idea how to clear an error condition in that case, treat it as unrecoverable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:35 +02:00
David Gibson	bd092ca421	udp: Split socket error handling out from udp_sock_recv() Currently udp_sock_recv() both attempts to clear socket errors and read a batch of datagrams for forwarding. That made sense initially, since both listening and reply sockets need to do this. However, we have certain error cases which will add additional complexity to the error processing. Furthermore, if we ever wanted to more thoroughly handle errors received here - e.g. by synthesising ICMP messages on the tap device - it will likely require different handling for the listening and reply socket cases. So, split handling of error events into its own udp_sock_errs() function. While we're there, allow it to report "unrecoverable errors". We don't have any of these so far, but some cases we're working on might require it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:33 +02:00
David Gibson	88bfa3801e	flow: Helpers to log details of a flow The details of a flow - endpoints, interfaces etc. - can be pretty important for debugging. We log this on flow state transitions, but it can also be useful to log this when we report specific conditions. Add some helper functions and macros to make it easy to do that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:30 +02:00
David Gibson	1166401c2f	udp: Allow UDP flows to be prematurely closed Unlike TCP, UDP has no in-band signalling for the end of a flow. So the only way we remove flows is on a timer if they have no activity for 180s. However, we've started to investigate some error conditions in which we want to prematurely abort / abandon a UDP flow. We can call udp_flow_close(), which will make the flow inert (sockets closed, no epoll events, can't be looked up in hash). However it will still wait 3 minutes to clear away the stale entry. Clean this up by adding an explicit 'closed' flag which will cause a flow to be more promptly cleaned up. We also publish udp_flow_close() so it can be called from other places to abort UDP flows(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:53:24 +02:00
David Gibson	7ad9f9bd2b	flow: Fix incorrect hash probe in flowside_lookup() Our flow hash table uses linear probing in which we step backwards through clusters of adjacent hash entries when we have near collisions. Usually that's implemented by flow_hash_probe(). However, due to some details we need a second implementation in flowside_lookup(). An embarrassing oversight in rebasing from earlier versions has mean that version is incorrect, trying to step forward through clusters rather than backward. In situations with the right sorts of has near-collisions this can lead to us not associating an ACK from the tap device with the right flow, leaving it in a not-quite-established state. If the remote peer does a shutdown() at the right time, this can lead to a storm of EPOLLRDHUP events causing high CPU load. Fixes: `acca4235c4` ("flow, tcp: Generalise TCP hash table to general flow hash table") Link: https://bugs.passt.top/show_bug.cgi?id=94 Suggested-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-09-06 12:52:31 +02:00
Stefano Brivio	0ea60e5a77	log: Don't prefix log file messages with time and severity if they're continuations In `fecb1b65b1` ("log: Don't prefix message with timestamp on --debug if it's a continuation"), I fixed this for --debug on standard error, but not for log files: if messages are continuations, they shouldn't be prefixed by timestamp and severity. Otherwise, we'll print stuff like this: 0.0028: ERROR: Receive error on guest connection, reset0.0028: ERROR: : Bad file descriptor Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-09-06 12:52:21 +02:00
Michal Privoznik	38363964fc	Makefile: Enable _FORTIFY_SOURCE iff needed On some systems source fortification is enabled whenever code optimization is enabled (e.g. with -O2). Since code fortification is explicitly enabled too (with possibly different value than the system wants, there are three levels [1]), distros are required to patch our Makefile, e.g. [2]. Detect whether fortification is not already enabled and enable it explicitly only if really needed. 1: https://www.gnu.org/software/libc/manual/html_node/Source-Fortification.html 2: `edfeb8763a` Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:21 +02:00
David Gibson	eedc81b6ef	fwd, conf: Probe host's ephemeral ports When we forward "all" ports (-t all or -u all), or use an exclude-only range, we don't actually forward all ports - that wouln't leave local ports to use for outgoing connections. Rather we forward all non-ephemeral ports - those that won't be used for outgoing connections or datagrams. Currently we assume the range of ephemeral ports is that recommended by RFC 6335, 49152-65535. However, that's not the range used by default on Linux, 32768-60999 but configurable with the net.ipv4.ip_local_port_range sysctl. We can't really know what range the guest will consider ephemeral, but if it differs too much from the host it's likely to cause problems we can't avoid anyway. So, using the host's ephemeral range is a better guess than using the RFC 6335 range. Therefore, add logic to probe the host's ephemeral range, falling back to the RFC 6335 range if that fails. This has the bonus advantage of reducing the number of ports bound by -t all -u all on most Linux machines thereby reducing kernel memory usage. Specifically this reduces kernel memory usage with -t all -u all from ~380MiB to ~289MiB. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:08 +02:00
David Gibson	4a41dc58d6	conf, fwd: Don't attempt to forward port 0 When using -t all, -u all or exclude-only ranges, we'll attempt to forward all non-ephemeral port numbers, including port 0. However, this won't work as intended: bind() treats a zero port not as literal port 0, but as "pick a port for me". Because of the special meaning of port 0, we mostly outright exclude it in our handling. Do the same for setting up forwards, not attempting to forward for port 0. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-08-29 22:26:05 +02:00

1 2 3 4 5 ...

1810 Commits