passt

mirror of https://passt.top/passt synced 2025-01-21 19:55:17 +00:00

Author	SHA1	Message	Date
David Gibson	08ea3cc581	tcp: Pass TCP header and payload separately to tcp_fill_headers[46]() At the moment these take separate pointers to the tap specific and IP headers, but expect the TCP header and payload as a single tcp_payload_t. As well as being slightly inconsistent, this involves some slightly iffy pointer shenanigans when called on the flags path with a tcp_flags_t instead of a tcp_payload_t. More importantly, it's inconvenient for the upcoming vhost-user case, where the TCP header and payload might not be contiguous. Furthermore, the payload itself might not be contiguous. So, pass the TCP header as its own pointer, and the TCP payload as an IO vector. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	2ee07697c4	tcp: Pass TCP header and payload separately to tcp_update_check_tcp[46]() Currently these expects both the TCP header and payload in a single IOV, and goes to some trouble to locate the checksum field within it. In the current caller we've already know where the TCP header is, so we might as well just pass it in. This will need to work a bit differently for vhost-user, but that code already needs to locate the TCP header for other reasons, so again we can just pass it in. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	67151090bc	iov, checksum: Replace csum_iov() with csum_iov_tail() We usually want to checksum only the tail part of a frame, excluding at least some headers. csum_iov() does that for a frame represented as an IO vector, not actually summing the entire IO vector. We now have struct iov_tail to explicitly represent this construct, so replace csum_iov() with csum_iov_tail() taking that representation rather than 3 parameters. We propagate the same change to csum_udp4() and csum_udp6() which take similar parameters. This slightly simplifies the code, and will allow some further simplifications as struct iov_tail is more widely used. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
David Gibson	f931103171	iov: iov tail helpers In the vhost-user code we have a number of places where we need to locate a particular header within the guest-supplied IO vector. We need to work out which buffer the header is in, and verify that it's contiguous and aligned as we need. At the moment this is open-coded, but introduce a helper to make this more straightforward. We add a new datatype 'struct iov_tail' representing an IO vector from which we've logically consumed some number of headers. The IOV_PULL_HEADER macro consumes a new header from the vector, returning a pointer and updating the iov_tail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-28 14:03:16 +01:00
Stefano Brivio	804a7ce94a	tcp_vu: Change 'dlen' to ssize_t in tcp_vu_data_from_sock() ...to quickly suppress a false positive from Coverity, which assumes that iov_size is 0 and 'dlen' might overflow as a result (with hdrlen being 66). An ASSERT() in tcp_vu_sock_recv() already guarantees that iov_size(iov, buf_cnt) here is anyway greater than 'hdrlen'. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Laurent Vivier <lvivier@redhat.com>	2024-11-27 16:49:21 +01:00
Laurent Vivier	00cc2303fd	Fix build on 32bit target Fix the following errors when built with CFLAGS="-m32 -U__AVX2__": packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 5 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=] 57 \| trace("packet offset plus length %lu from size %lu, " 58 \| "%s:%i", start - p->buf + len + offset, \| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} packet.c:57:23: warning: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 6 has type ‘size_t’ {aka ‘unsigned int’} [-Wformat=] 57 \| trace("packet offset plus length %lu from size %lu, " \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 58 \| "%s:%i", start - p->buf + len + offset, 59 \| p->buf_size, func, line); \| ~~~~~~~~~~~ \| \| \| size_t {aka unsigned int} vhost_user.c:139:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 139 \| return (void )(qemu_addr - r->qva + r->mmap_addr + \| ^ vhost_user.c:439:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 439 \| munmap((void )r->mmap_addr, r->size + r->mmap_offset); \| ^ vhost_user.c:900:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 900 \| munmap((void )r->mmap_addr, r->size + r->mmap_offset); \| ^ virtio.c:111:32: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 111 \| return (void )(guest_addr - r->gpa + r->mmap_addr + \| ^ vu_common.c:37:27: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] 37 \| char m = (char )dev_region->mmap_addr; \| ^ Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:49:21 +01:00
Laurent Vivier	6fae899cbb	virtio: check if avail ring is configured If the connection to the vhost-user front end is closed during transfers virtio rings are deconfigured and not available anymore, but we can try to access them to process queued data. This can trigger a SIGSEG as we try to access unavailable memory. To fix that check vq->vring.avail is sane before accessing the vring Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:49:21 +01:00
David Gibson	7e131e920c	tcp: Move tcp_l2_buf_fill_headers() to tcp_buf.c This function only has callers in tcp_buf.c. More importantly, it's inherently tied to the "buf" path, because it uses internal knowledge of how we lay out the various headers across our locally allocated buffers. Therefore, move it to tcp_buf.c. Slightly reformat the prototypes while we're at it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:49:21 +01:00
Stefano Brivio	676bf5488e	test: Add tests for passt in vhost-user mode Run functional and performance tests for vhost-user mode as well. For functional tests, we add passt_vu and passt_vu_in_ns as symbolic links to their non-vhost-user counterparts, as no differences are intended but we want to distinguish them in test logs. For performance tests, instead, we add separate perf/passt_vu_tcp and perf/passt_vu_udp files, as we need longer test duration, as well as higher UDP sending bandwidths and larger TCP windows, to actually get the highest throughput vhost-user mode offers. For valgrind tests, vhost-user mode needs two extra system calls: statx and readlink. Add them as EXTRA_SYSCALLS for the valgrind target. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-27 16:49:21 +01:00
Laurent Vivier	28997fcb29	vhost-user: add vhost-user add virtio and vhost-user functions to connect with QEMU. $ ./passt --vhost-user and # qemu-system-x86_64 ... -m 4G \ -object memory-backend-memfd,id=memfd0,share=on,size=4G \ -numa node,memdev=memfd0 \ -chardev socket,id=chr0,path=/tmp/passt_1.socket \ -netdev vhost-user,id=netdev0,chardev=chr0 \ -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \ ... Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: as suggested by lvivier, include <netinet/if_ether.h> before including <linux/if_ether.h> as C libraries such as musl __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have a definition of struct ethhdr] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:47:32 +01:00
Laurent Vivier	b2e62f7e85	passt: rename tap_sock_init() to tap_backend_init() Extract pool storage initialization loop to tap_sock_update_pool(), extract QEMU hints to tap_backend_show_hints(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:27 +01:00
Laurent Vivier	b7c292b758	tcp: Export headers functions Export tcp_fill_headers[4\|6]() and tcp_update_check_tcp[4\|6](). They'll be needed by vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:24 +01:00
Laurent Vivier	5a8b33c667	udp: Prepare udp.c to be shared with vhost-user Export udp_payload_t, udp_update_hdr4(), udp_update_hdr6() and udp_sock_errs(). Rename udp_listen_sock_handler() to udp_buf_listen_sock_handler() and udp_reply_sock_handler to udp_buf_reply_sock_handler(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:21 +01:00
Laurent Vivier	31117b27c6	vhost-user: introduce vhost-user API Add vhost_user.c and vhost_user.h that define the functions needed to implement vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:12:19 +01:00
Laurent Vivier	7d1cd4dbf5	vhost-user: introduce virtio API Add virtio.c and virtio.h that define the functions needed to manage virtqueues. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:11:36 +01:00
Laurent Vivier	dd143e3890	packet: replace struct desc by struct iovec To be able to manage buffers inside a shared memory provided by a VM via a vhost-user interface, we cannot rely on the fact that buffers are located in a pre-defined memory area and use a base address and a 32bit offset to address them. We need a 64bit address, so replace struct desc by struct iovec and update range checking. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 16:11:18 +01:00
Stefano Brivio	c0fbc7ef2a	dhcp: Honour broadcast flag (RFC 2131, 4.1) It's widely considered a legacy option nowadays, and I've haven't seen clients setting it since Windows 95, but it's convenient for a minimal DHCP client not using raw IP sockets such as what I'm playing with for muvm. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> 2024_11_27.c0fbc7e	2024-11-27 05:37:28 +01:00
Stefano Brivio	9da2038485	dhcp: Introduce support for Rapid Commit (option 80, RFC 4039) I'm trying to speed up and simplify IP address acquisition in muvm. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-27 05:37:28 +01:00
Stefano Brivio	d6e9e2486f	dhcp: Use -1 as "missing option" length instead of 0 We want to add support for option 80 (Rapid Commit, RFC 4039), whose length is 0. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-27 05:37:28 +01:00
Stefano Brivio	14b84a7f07	treewide: Introduce 'local mode' for disconnected setups There are setups where no host interface is available or configured at all, intentionally or not, temporarily or not, but users expect (Podman) containers to run in any case as they did with slirp4netns, and we're now getting reports that we broke such setups at a rather alarming rate. To this end, if we don't find any usable host interface, instead of exiting: - for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2 as default gateway - for IPv6, don't assign any address (forcibly disable DHCPv6), and use the first link-local address we observe to represent the guest/container. Advertise fe80::1 as default gateway - use 'tap0' as default interface name for pasta Change ifi4 and ifi6 in struct ctx to int and accept a special -1 value meaning that no host interface was selected, but the IP family is enabled. The fact that the kernel uses unsigned int values for those is not an issue as 1. one can't create so many interfaces anyway and 2. we otherwise handle those values transparently. Fix a botched conditional in conf_print() to actually skip printing DHCPv6 information if DHCPv6 is disabled (and skip printing NDP information if NDP is disabled). Link: https://github.com/containers/podman/issues/24614 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-27 05:16:38 +01:00
David Gibson	c6e6106413	test: Improve logic for waiting for SLAAC & DAD to complete in NDP tests Since 9a0e544f05bf the NDP tests attempt to explicitly wait for DAD to complete, rather than just having a hard coded sleep. However, the conditions we use are a bit sloppy and allow for a number of possible cases where it might not work correctly. Stefano seems to be hitting one of these (though I'm not sure which) with some later patches. - We wait for lack of a tentative address, so if the first check occurs before we have even a tentative address it will bypass the delay - It's not entirely clear if the permanent address will always appear as soon as the tentative address disappears - We weren't filtering on interface - We were doing the filtering with ip-address options rather than in jq. However in at least in some circumstances this seems to result in an empty .addr_info field, rather than omitting it entirely, which could cause us to get the wrong result So, instead, explicitly wait for the address we need to be present: an RA provided address on the external interface. While we're here we remove the requirement that it have global scope: the "kernel_ra" check is already sufficient to make sure this address comes from an NDP RA, not something else. If it's not the global scope address we expect, better to check it and fail, rather than keep waiting. Fixes: 9a0e544f05bf ("test: Improve test for NDP assigned prefix") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-26 08:30:18 +01:00
Stefano Brivio	cda7f160f0	ndp: Don't send first periodic router advertisement right after guest connects This is very visible with muvm, but it also happens with QEMU: we're sending the first unsolicited router advertisement milliseconds after the guest connects. That's usually pointless because, when the hypervisor connects, the guest is typically not ready yet to process anything of that sort: it's still booting. And if we happen to send it late enough (still milliseconds), with muvm, while the message is discarded, it sometimes (slightly) delays the response to the first solicited router advertisement, which is the one we need to have coming fast. Skip sending the unsolicited advertisement on the first timer run, just calculate the next delay. Keep it simple by observing that we're probably not trying to reach the 1970s with IPv6. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-26 08:30:18 +01:00
Stefano Brivio	2bf8ffcf07	test/perf: Select a single IPv6 namespace address in pasta tests By dropping the filter on prefix length, commit 910f4f910301 ("test: Don't require 64-bit prefixes in perf tests") broke tests on setups where two global unicast IPv6 addresses are available, which is the typical case when the "host" is a VM running under passt with addresses from SLAAC and DHCPv6, because two addresses will be returned. Pick the first one instead. We don't really care about the prefix length, any of these addresses will work. Fixes: 910f4f910301 ("test: Don't require 64-bit prefixes in perf tests") Link: https://archives.passt.top/passt-dev/20241119214344.6b4a5b3a@elisabeth/ Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-26 08:30:18 +01:00
Stefano Brivio	6819b2e102	conf, passt.1: Update --mac-addr default in usage() and man page Fixes: 90e83d50a9bd ("Don't take "our" MAC address from the host") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-26 08:30:18 +01:00
Stefano Brivio	b61be8468a	passt.1: Fix "default" note about --map-guest-addr It's not true that there's no mapping by default: there's no mapping in the --map-guest-addr sense, by default, but in that case the default --map-host-loopback behaviour prevails. While at it, fix a typo. Fixes: 57b7bd2a48a1 ("fwd, conf: Allow NAT of the guest's assigned address") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-26 08:30:18 +01:00
Stefano Brivio	238c69f9af	tcp: Acknowledge keep-alive segments, ignore them for the rest RFC 9293, 3.8.4 says: Implementers MAY include "keep-alives" in their TCP implementations (MAY-5), although this practice is not universally accepted. Some TCP implementations, however, have included a keep-alive mechanism. To confirm that an idle connection is still active, these implementations send a probe segment designed to elicit a response from the TCP peer. Such a segment generally contains SEG.SEQ = SND.NXT-1 and may or may not contain one garbage octet of data. If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection (MUST-24), and they MUST default to off (MUST-25). but currently, tcp_data_from_tap() is not aware of this and will schedule a fast re-transmit on the second keep-alive (because it's also a duplicate ACK), ignoring the fact that the sequence number was rewinded to SND.NXT-1. ACK these keep-alive segments, reset the activity timeout, and ignore them for the rest. At some point, we could think of implementing an approximation of keep-alive segments on outbound sockets, for example by setting TCP_KEEPIDLE to 1, and a large TCP_KEEPINTVL, so that we send a single keep-alive segment at approximately the same time, and never reset the connection. That's beyond the scope of this fix, though. Reported-by: Tim Besard <tim.besard@gmail.com> Link: https://github.com/containers/podman/discussions/24572 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> 2024_11_21.238c69f	2024-11-21 06:52:36 +01:00
Stefano Brivio	af464c4ffb	tcp: Reset ACK_TO_TAP_DUE flag whenever an ACK isn't needed anymore We enter the timer handler with the ACK_TO_TAP_DUE flag, call tcp_prepare_flags() with ACK_IF_NEEDED, and realise that we acknowledged everything meanwhile, so we return early, but we also need to reset that flag to avoid unnecessarily scheduling the timer over and over again until more pending data appears. I'm not sure if this fixes any real issue, but I've spotted this in several logs reported by users, including one where we have some unexpected bursts of high CPU load during TCP transfers at low rates, from https://github.com/containers/podman/issues/23686. Link: https://github.com/containers/podman/discussions/24572 Link: https://github.com/containers/podman/issues/23686 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-21 06:51:25 +01:00
David Gibson	5ae21841ac	ndp: Don't send unsolicited RAs if NDP is disabled We recently added support for sending unsolicited NDP Router Advertisement packets. While we (correctly) disable this if the --no-ra option is given we incorrectly still send them if --no-ndp is set. Fix the oversight. Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-19 21:10:42 +01:00
Stefano Brivio	bf9492747d	ndp: Don't send unsolicited router advertisement if we can't, yet ndp_timer() is called right away on the first epoll_wait() cycle, when the communication channel to the guest isn't ready yet: 1.0038: NDP: sending unsolicited RA, next in 264s 1.0038: tap: failed to send 1 frames of 1 check that it's up before sending it. This effectively delays the first gratuitous router advertisement, which is probably a good idea given that we expect the guest to send a router solicitation right away. Fixes: 6e1e44293ef9 ("ndp: Send unsolicited Router Advertisements") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-19 21:10:14 +01:00
Stefano Brivio	5e24466677	selinux: Use auth_read_passwd() interface for all our getpwnam() needs If passt or pasta are started as root, we need to read the passwd file (be it /etc/passwd or whatever sssd provides) to find out UID and GID of 'nobody' so that we can switch to it. Instead of a bunch of allow rules for passwd_file_t and sssd macros, use the more convenient auth_read_passwd() interface which should cover our usage of getpwnam(). The existing rules weren't actually enough: # strace -e openat passt -f [...] Started as root, will change to nobody. openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY\|O_CLOEXEC) = 4 openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY\|O_CLOEXEC) = 4 openat(AT_FDCWD, "/lib64/libnss_sss.so.2", O_RDONLY\|O_CLOEXEC) = 4 openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY\|O_CLOEXEC) = -1 EACCES (Permission denied) openat(AT_FDCWD, "/var/lib/sss/mc/passwd", O_RDONLY\|O_CLOEXEC) = -1 EACCES (Permission denied) openat(AT_FDCWD, "/etc/passwd", O_RDONLY\|O_CLOEXEC) = 4 with corresponding SELinux warnings logged in audit.log. Reported-by: Minxi Hou <mhou@redhat.com> Analysed-by: Miloš Malik <mmalik@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-19 21:10:14 +01:00
David Gibson	6e1e44293e	ndp: Send unsolicited Router Advertisements Currently, our NDP implementation only sends Router Advertisements (RA) when it receives a Router Solicitation (RS) from the guest. However, RFC 4861 requires that we periodically send unsolicited RAs. Linux as a guest also requires this: it will send an RS when a link first comes up, but the route it gets from this will have a finite lifetime (we set this to 65535s, the maximum allowed, around 18 hours). When that expires the guest will not send a new RS, but instead expects the route to have been renewed (if still valid) by an unsolicited RA. Implement sending unsolicited RAs on a partially randomised timer, as required by RFC 4861. The RFC also specifies that solicited RAs should also be delayed, or even omitted, if the next unsolicited RA is soon enough. For now we don't do that, always sending an immediate RA in response to an RS. We can get away with this because in our use cases we expect to just have passt itself and the guest on the link, rather than a large broadcast domain. Link: https://github.com/kubevirt/kubevirt/issues/13191 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:40 +01:00
David Gibson	b39760cc7d	passt: Seed libc's pseudo random number generator We have an upcoming case where we need pseudo-random numbers to scatter timings, but we don't need cryptographically strong random numbers. libc's built in random() is fine for this purpose, but we should seed it. Extend secret_init() - the only current user of random numbers - to do this as well as generating the SipHash secret. Using /dev/random for a PRNG seed is probably overkill, but it's simple and we only do it once, so we might as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:38 +01:00
David Gibson	71d5deed5e	util: Add general low-level random bytes helper Currently secret_init() open codes getting good quality random bytes from the OS, either via getrandom(2) or reading /dev/random. We're going to add at least one more place that needs random data in future, so make a general helper for getting random bytes. While we're there, fix a number of minor bugs: - getrandom() can theoretically return a "short read", so handle that case - getrandom() as well as read can return a transient EINTR - We would attempt to read data from /dev/random if we failed to open it (open() returns -1), but not if we opened it as fd 0 (unlikely, but ok) - More specific error reporting Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:36 +01:00
David Gibson	a60703e899	ndp: Make route lifetime a #define Currently we open-code the lifetime of the route we advertise via NDP to be 65535s (the maximum). Change it to a #define. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:34 +01:00
David Gibson	36c070e6e3	ndp: Use struct assignment in preference to memcpy() for IPv6 addresses There are a number of places we can simply assign IPv6 addresses about, rather than the current mildly ugly memcpy(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:31 +01:00
David Gibson	cbc83e14df	ndp: Split out helpers for sending specific NDP message types Currently the large ndp() function responds to all NDP messages we handle, both parsing the message as necessary and sending the response. Split out the code to construct and send specific message types into ndp_na() (to send NA messages) and ndp_ra() (to send RA messages). As well as breaking up an excessively large function, this is a first step to being able to send unsolicited NDP messages. While we're there, remove a slighty ugly goto. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:29 +01:00
David Gibson	4e47167035	ndp: Add ndp_send() helper ndp() has a conditional on message type generating the reply message, then a tiny amount of common code, then another conditional to send the reply with slightly different parameters. We can make this a bit neater by making a helper function for sending the reply, and call it from each of the different message type paths. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:28 +01:00
David Gibson	71f228d04b	ndp: Remove redundant update to addr_seen ndp() updates addr_seen or addr_ll_seen based on the source address of the received packet. This is redundant since tap6_handler() has already updated addr_seen for any type of packet, not just NDP. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-14 19:00:13 +01:00
David Gibson	0588163b1f	cppcheck: Don't check the system headers We pass -I options to cppcheck so that it will find the system headers. Then we need to pass a bunch more options to suppress the zillions of cppcheck errors found in those headers. It turns out, however, that it's not recommended to give the system headers to cppcheck anyway. Instead it has built-in knowledge of the ANSI libc and uses that as the basis of its checks. We do need to suppress missingIncludeSystem warnings instead though. Not bothering with the system headers makes the cppcheck runtime go from ~37s to ~14s on my machine, which is a pretty nice win. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-08 08:26:21 +01:00
David Gibson	14dd70e2b3	linux_dep: Fix CLOSE_RANGE_UNSHARE availability handling If CLOSE_RANGE_UNSHARE isn't defined, we define a fallback version of close_range() which is a (successful) no-op. This is broken in several ways: * It doesn't actually fix compile if using old kernel headers, because the caller of close_range() still directly uses CLOSE_RANGE_UNSHARE unprotected by ifdefs * Even if it did fix the compile, it means inconsistent behaviour between a compile time failure to find the value (we silently don't close files) and a runtime failure (we die with an error from close_range()) * Silently not closing the files we intend to close for security reasons is probably not a good idea in any case We don't want to simply error if close_range() or CLOSE_RANGE_UNSHARE isn't available, because that would require running on kernel >= 5.9. On the other hand there's not really any other way to flush all possible fds leaked by the parent (close() in a loop takes over a minute). So in this case print a warning and carry on. As bonus this fixes a cppcheck error I see with some different options I'm looking to apply in future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-08 08:26:17 +01:00
David Gibson	d64f257243	linux_dep: Move close_range() conditional handling to linux_dep.h util.h has some #ifdefs and weak definitions to handle compatibility with various kernel versions. Move this to linux_dep.h which handles several other similar cases. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-08 08:26:15 +01:00
David Gibson	b84cd05098	log: Only check for FALLOC_FL_COLLAPSE_RANGE availability at runtime log.c has several #ifdefs on FALLOC_FL_COLLAPSE_RANGE that won't attempt to use it if not defined. But even if the value is defined at compile time, it might not be available in the runtime kernel, so we need to check for errors from a fallocate() call and fall back to other methods. Simplify this to only need the runtime check by using linux_dep.h to define FALLOC_FL_COLLAPSE_RANGE if it's not in the kernel headers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-08 08:25:58 +01:00
Stefano Brivio	58fa5508bd	tap, tcp, util: Add some missing SOCK_CLOEXEC flags I have no idea why, but these are reported by clang-tidy (19.2.1) on Alpine (x86) only: /home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1139 \| int fd = socket(AF_UNIX, SOCK_STREAM, 0); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1158 \| ex = socket(AF_UNIX, SOCK_STREAM \| SOCK_NONBLOCK, 0); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 1413 \| s = socket(af, SOCK_STREAM \| SOCK_NONBLOCK, IPPROTO_TCP); \| ^ \| \| SOCK_CLOEXEC /home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors] 188 \| if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) { \| ^ \| \| SOCK_CLOEXEC Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:58 +01:00
Stefano Brivio	71869e2912	passt: Use NOLINT clang-tidy block instead of NOLINTNEXTLINE For some reason, this is only reported by clang-tidy 19.1.2 on Alpine: /home/sbrivio/passt/passt.c:314:53: error: conditional operator with identical true and false expressions [bugprone-branch-clone,-warnings-as-errors] 314 \| nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL); \| ^ We do have a suppression, but not on the line preceding it, because we also need a cppcheck suppression there. Use NOLINTBEGIN/NOLINTEND for the clang-tidy suppression. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:52 +01:00
Stefano Brivio	d4f09c9b96	util: Define small and big thresholds for socket buffers as unsigned long long On 32-bit architectures, clang-tidy reports: /home/pi/passt/tcp.c:728:11: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors] 728 \| if (v >= SNDBUF_BIG) \| ^ /home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG' 158 \| #define SNDBUF_BIG (4UL * 1024 * 1024) \| ^ /home/pi/passt/tcp.c:728:11: note: make conversion explicit to silence this warning 728 \| if (v >= SNDBUF_BIG) \| ^ /home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG' 158 \| #define SNDBUF_BIG (4UL * 1024 * 1024) \| ^~~~~~~~~~~~~~~~~ /home/pi/passt/tcp.c:728:11: note: perform multiplication in a wider type 728 \| if (v >= SNDBUF_BIG) \| ^ /home/pi/passt/util.h:158:22: note: expanded from macro 'SNDBUF_BIG' 158 \| #define SNDBUF_BIG (4UL * 1024 * 1024) \| ^~~~~~~~~~ /home/pi/passt/tcp.c:730:15: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors] 730 \| else if (v > SNDBUF_SMALL) \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^ /home/pi/passt/tcp.c:730:15: note: make conversion explicit to silence this warning 730 \| else if (v > SNDBUF_SMALL) \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^~~~~~~~~~~~ /home/pi/passt/tcp.c:730:15: note: perform multiplication in a wider type 730 \| else if (v > SNDBUF_SMALL) \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^~~~~ /home/pi/passt/tcp.c:731:17: error: performing an implicit widening conversion to type 'uint64_t' (aka 'unsigned long long') of a multiplication performed in type 'unsigned long' [bugprone-implicit-widening-of-multiplication-result,-warnings-as-errors] 731 \| v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2; \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^ /home/pi/passt/tcp.c:731:17: note: make conversion explicit to silence this warning 731 \| v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2; \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^~~~~~~~~~~~ /home/pi/passt/tcp.c:731:17: note: perform multiplication in a wider type 731 \| v -= v * (v - SNDBUF_SMALL) / (SNDBUF_BIG - SNDBUF_SMALL) / 2; \| ^ /home/pi/passt/util.h:159:24: note: expanded from macro 'SNDBUF_SMALL' 159 \| #define SNDBUF_SMALL (128UL * 1024) \| ^~~~~ because, wherever we use those thresholds, we define the other term of comparison as uint64_t. Define the thresholds as unsigned long long as well, to make sure we match types. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:49 +01:00
Stefano Brivio	87940f9aa7	tap: Cast TAP_BUF_BYTES - ETH_MAX_MTU to ssize_t, not TAP_BUF_BYTES Given that we're comparing against 'n', which is signed, we cast TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be signed. This doesn't necessarily happen on 32-bit architectures, though. On armhf and i686, clang-tidy 18.1.8 and 19.1.2 report: /home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors] 1087 \| for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) { \| ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cast the whole difference to ssize_t, as we know it's going to be positive anyway, instead of relying on that side effect. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:45 +01:00
Stefano Brivio	1feb90fe62	dhcpv6: Turn some option headers pointers to const cppcheck 2.14.2 on Alpine reports: dhcpv6.c:431:32: style: Variable 'client_id' can be declared as pointer to const [constVariablePointer] struct opt_hdr ia, bad_ia, *client_id; ^ It's not only 'client_id': we can declare 'ia' as const pointer too. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:41 +01:00
Stefano Brivio	5f5e814cfc	dhcpv6: Use for loop instead of goto to avoid false positive cppcheck warning cppcheck 2.16.0 reports: dhcpv6.c:334:14: style: The comparison 'ia_type == 3' is always true. [knownConditionTrueFalse] if (ia_type == OPT_IA_NA) { ^ dhcpv6.c:306:12: note: 'ia_type' is assigned value '3' here. ia_type = OPT_IA_NA; ^ dhcpv6.c:334:14: note: The comparison 'ia_type == 3' is always true. if (ia_type == OPT_IA_NA) { ^ this is not really the case as we set ia_type to OPT_IA_TA and then jump back. Anyway, there's no particular reason to use a goto here: add a trivial foreach() macro to go through elements of an array and use it instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-11-08 08:24:11 +01:00
Jon Maloy	78da088f7b	tcp: unify payload and flags l2 frames array In order to reduce static memory and code footprint, we merge the array for l2 flag frames into the one for payload frames. This change also ensures that no flag message will be sent out over the l2 media bypassing already queued payload messages. Performance measurements with iperf3, where we force all traffic via the tap queue, show no significant difference: Dual traffic both directions sinmultaneously, with patch: ======================================================== host->ns: -------- [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-100.00 sec 36.3 GBytes 3.12 Gbits/sec 4759 sender [ 5] 0.00-100.04 sec 36.3 GBytes 3.11 Gbits/sec receiver ns->host: --------- [ ID] Interval Transfer Bitrate [ 5] 0.00-100.00 sec 321 GBytes 27.6 Gbits/sec receiver Dual traffic both directions sinmultaneously, without patch: ============================================================ host->ns: -------- [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-100.00 sec 35.0 GBytes 3.01 Gbits/sec 6001 sender [ 5] 0.00-100.04 sec 34.8 GBytes 2.99 Gbits/sec receiver ns->host -------- [ ID] Interval Transfer Bitrate [ 5] 0.00-100.00 sec 345 GBytes 29.6 Gbits/sec receiver Single connection, with patch: ============================== host->ns: --------- [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-100.00 sec 138 GBytes 11.8 Gbits/sec 922 sender [ 5] 0.00-100.04 sec 138 GBytes 11.8 Gbits/sec receiver ns->host: ----------- [ ID] Interval Transfer Bitrate [ 5] 0.00-100.00 sec 430 GBytes 36.9 Gbits/sec receiver Single connection, without patch: ================================= host->ns: ------------ [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-100.00 sec 139 GBytes 11.9 Gbits/sec 900 sender [ 5] 0.00-100.04 sec 139 GBytes 11.9 Gbits/sec receiver ns->host: --------- [ ID] Interval Transfer Bitrate [ 5] 0.00-100.00 sec 440 GBytes 37.8 Gbits/sec receiver Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:47:41 +01:00
David Gibson	9a0e544f05	test: Improve test for NDP assigned prefix In the NDP tests we search explicitly for a guest address with prefix length 64. AFAICT this is an attempt to specifically find the SLAAC assigned address, rather than something assigned by other means. We can do that more explicitly by checking for .protocol == "kernel_ra". however. The SLAAC prefixes we assigned will always be 64-bit, that's hard-coded into our NDP implementation. RFC4862 doesn't really allow anything else since the interface identifiers for an Ethernet-like link are 64-bits. Let's actually verify that, rather than just assuming it, by extracting the prefix length assigned in the guest and checking it as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-11-07 12:47:37 +01:00

1 2 3 4 5 ...

1802 Commits