passt

mirror of https://passt.top/passt synced 2025-02-24 20:02:20 +00:00

Author	SHA1	Message	Date
David Gibson	e56c8038fc	tcp: More type safety for tcp_flow_migrate_target_ext() tcp_flow_migrate_target_ext() takes a raw union flow , although it is TCP specific, and requires a FLOW_TYPE_TCP entry. Our usual convention is that such functions should take a struct tcp_tap_conn instead. Convert it to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 13:32:52 +01:00
David Gibson	5a07eb3ccc	tcp_vu: head_cnt need not be global head_cnt is a global variable which tracks how many entries in head[] are currently used. The fact that it's global obscures the fact that the lifetime over which it has a meaningful value is quite short: a single call to of tcp_vu_data_from_sock(). Make it a local to tcp_vu_data_from_sock() to make that lifetime clearer. We keep the head[] array global for now - although technically it has the same valid lifetime - because it's large enough we might not want to put it on the stack. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 11:28:37 +01:00
David Gibson	6b4065153c	tap: Remove unused ETH_HDR_INIT() macro The uses of this macro were removed in d4598e1d18ac ("udp: Use the same buffer for the L2 header for all frames"). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:18 +01:00
David Gibson	354bc0bab1	packet: Don't pass start and offset separately to packet_check_range() Fundamentally what packet_check_range() does is to check whether a given memory range is within the allowed / expected memory set aside for packets from a particular pool. That range could represent a whole packet (from packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't really matter which. However, we pass the start of the range as two parameters: @start which is the start of the packet, and @offset which is the offset within the packet of the range we're interested in. We never use these separately, only as (start + offset). Simplify the interface of packet_check_range() and vu_packet_check_range() to directly take the start of the relevant range. This will allow some additional future improvements. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:12 +01:00
David Gibson	0a51060f7a	packet: Use flexible array member in struct pool Currently we have a dummy pkt[1] array, which we alias with an array of a different size via various macros. However, we already require C11 which includes flexible array members, so we can do better. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:43:04 +01:00
Enrique Llorente	bcc4908c2b	dhcp: Remove option 255 length byte The option 255 (end of options) do not need the length byte, this change remove that allowing to have one extra byte at other dynamic options. Signed-off-by: Enrique Llorente <ellorent@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-18 08:42:35 +01:00
Stefano Brivio	a1e48a02ff	test: Add migration tests PCAP=1 ./run migrate/bidirectional gives an overview of how the whole thing is working. Add 12 tests in total, checking basic functionality with and without flows in both directions, with and without sockets in half-closed states (both inbound and outbound), migration behaviour under traffic flood, under traffic flood with > 253 flows, and strict checking of sequences under flood with ramp patterns in both directions. These tests need preparation and teardown for each case, as we need to restore the source guest in its own context and pane before we can test again. Eventually, we could consider alternating source and target so that we don't need to restart from scratch every time, but that's beyond the scope of this initial test implementation. Trick: './run migrate/*' runs all the tests with preparation and teardown steps. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> 2025_02_17.a1e48a0	2025-02-17 08:29:36 +01:00
Stefano Brivio	89ecf2fd40	migrate: Migrate TCP flows This implements flow preparation on the source, transfer of data with a format roughly inspired by struct tcp_tap_conn, plus a specific structure for parameters that don't fit in the flow table, and flow insertion on the target, with all the appropriate window options, window scaling, MSS, etc. Contents of pending queues are transferred as well. The target side is rather convoluted because we first need to create sockets and switch them to repair mode, before we can apply options that are not stored in the flow table. This also means that, if we're testing this on the same machine, in the same namespace, we need to close the listening socket on the source before we can start moving data. Further, we need to connect() the socket on the target before we can restore data queues, but we can't do that (again, on the same machine) as long as the matching source socket is open, which implies an arbitrary limit on queue sizes we can transfer, because we can only dump pending queues on the source as long as the socket is open, of course. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Tested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:29:03 +01:00
Stefano Brivio	3e903bbb1f	repair, passt-repair: Build and warning fixes for musl Checked against musl 1.2.5. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-17 08:28:48 +01:00
Stefano Brivio	01b6a164d9	tcp_splice: A typo three years ago and SO_RCVLOWAT is gone In commit e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation"), this: if (!bitmap_isset(rcvlowat_set, conn - ts) && readlen > (long)c->tcp.pipe_size / 10) { (note the !) became: if (conn->flags & lowat_set_flag && readlen > (long)c->tcp.pipe_size / 10) { in the new tcp_splice_sock_handler(). We want to check, there, if we should set SO_RCVLOWAT, only if we haven't set it already. But, instead, we're checking if it's already set before we set it, so we'll never set it, of course. Fix the check and re-enable the functionality, which should give us improved CPU utilisation in non-interactive cases where we are not transferring at full pipe capacity. Fixes: e5eefe77435a ("tcp: Refactor to use events instead of states, split out spliced implementation") Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:28:45 +01:00
Stefano Brivio	667caa09c6	tcp_splice: Don't wake up on input data if we can't write it anywhere If we set the OUT_WAIT_* flag (waiting on EPOLLOUT) for a side of a given flow, it means that we're blocked, waiting for the receiver to actually receive data, with a full pipe. In that case, if we keep EPOLLIN set for the socket on the other side (our receiving side), we'll get into a loop such as: 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 41.0230: pasta: epoll event on connected spliced TCP socket 108 (events: 0x00000001) 41.0230: Flow 1 (TCP connection (spliced)): -1 from read-side call 41.0230: Flow 1 (TCP connection (spliced)): -1 from write-side call (passed 8192) 41.0230: Flow 1 (TCP connection (spliced)): event at tcp_splice_sock_handler:577 leading to 100% CPU usage, of course. Drop EPOLLIN on our receiving side as long when we're waiting for output readiness on the other side. Link: https://github.com/containers/podman/issues/23686#issuecomment-2661036584 Link: https://www.reddit.com/r/podman/comments/1iph50j/pasta_high_cpu_on_podman_rootless_container/ Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-17 08:27:30 +01:00
David Gibson	7c33b12086	vhost_user: Clear ring address on GET_VRING_BASE GET_VRING_BASE stops the queue, clearing the call and kick fds. However, we don't clear vring.avail. That means that if vu_queue_notify() is called it won't realise the queue isn't ready and will die with an EBADFD. We get this during migration, because for some reason, qemu reconfigures the vhost-user device when a migration is triggered. There's a window between the GET_VRING_BASE and re-establishing the call fd where the notify function can be called, causing a crash. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-15 05:34:21 +01:00
Stefano Brivio	71249ef3f9	tcp, tcp_splice: Don't set SO_SNDBUF and SO_RCVBUF to maximum values I added this a long long time ago because it dramatically improved throughput back then: with rmem_max and wmem_max >= 4 MiB, we would force send and receive buffer sizes for TCP sockets to the maximum allowed value. This effectively disables TCP auto-tuning, which would otherwise allow us to exceed those limits, as crazy as it might sound. But in any case, it made sense. Now that we have zero (internal) copies on every path, plus vhost-user support, it turns out that these settings are entirely obsolete. I get substantially the same throughput in every test we perform, even with very short durations (one second). The settings are not just useless: they actually cause us quite some trouble on guest state migration, because they lead to huge queues that need to be moved as well. Drop those settings. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 12:02:55 +01:00
Stefano Brivio	30f1e082c3	tcp: Keep updating window and checking for socket data after FIN from guest Once we get a FIN segment from the container/guest, we enter something resembling CLOSE_WAIT (from the perspective of the peer), but that doesn't mean that we should stop processing window updates from the guest and checking for socket data if the guest acknowledges something. If we don't do that, we can very easily run into a situation where we send a burst of data to the tap, get a zero window update, along with a FIN segment, because the flow is meant to be unidirectional, and now the connection will be stuck forever, because we'll ignore updates. Reproducer, server: $ pasta --config-net -t 9999 -- sh -c 'echo DONE \| socat TCP-LISTEN:9997,shut-down STDIO' and client: $ ./test/rampstream send 50000 \| socat -u STDIN TCP:$LOCAL_ADDR:9997 2025/02/13 09:14:45 socat[2997126] E write(5, 0x55f5dbf47000, 8192): Broken pipe while at it, update the message string for the third passive close state (which we see in this case): it's CLOSE_WAIT, not LAST_ACK. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 10:04:39 +01:00
Stefano Brivio	98d474c895	contrib/selinux: Enable mapping guest memory for libvirt guests This doesn't actually belong to passt's own policy: we should export an interface and libvirt's policy should use it, because passt's policy shouldn't be aware of svirt_image_t at all. However, libvirt doesn't maintain its own policy, which makes policy updates rather involved. Add this workaround to ensure --vhost-user is working in combination with libvirt, as it might take ages before we can get the proper rule in libvirt's policy. Reported-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-14 10:04:39 +01:00
Stefano Brivio	9a84df4c3f	selinux: Add rules needed to run tests ...other than being convenient, they might be reasonably representative of typical stand-alone usage. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-13 00:42:52 +01:00
David Gibson	a301158456	rampstream: Add utility to test for corruption of data streams Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:48:17 +01:00
Stefano Brivio	6f122f0171	tcp: Get bound address for connected inbound sockets too So that we can bind inbound sockets to specific addresses, like we already do for outbound sockets. While at it, change the error message in tcp_conn_from_tap() to match this one. Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:48:00 +01:00
Stefano Brivio	f3fe795ff5	vhost_user: Make source quit after reporting migration state This will close all the sockets we currently have open in repair mode, and completes our migration tasks as source. If the hypervisor wants to have us back at this point, somebody needs to restart us. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:47:51 +01:00
Stefano Brivio	b899141ad5	Add interfaces and configuration bits for passt-repair In vhost-user mode, by default, create a second UNIX domain socket accepting connections from passt-repair, with the usual listener socket. When we need to set or clear TCP_REPAIR on sockets, we'll send them via SCM_RIGHTS to passt-repair, who sets the socket option values we ask for. To that end, introduce batched functions to request TCP_REPAIR settings on sockets, so that we don't have to send a single message for each socket, on migration. When needed, repair_flush() will send the message and check for the reply. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:47:28 +01:00
David Gibson	155cd0c41e	migrate: Migrate guest observed addresses Most of the information in struct ctx doesn't need to be migrated. Either it's strictly back end information which is allowed to differ between the two ends, or it must already be configured identically on the two ends. There are a few exceptions though. In particular passt learns several addresses of the guest by observing what it sends out. If we lose this information across migration we might get away with it, but if there are active flows we might misdirect some packets before re-learning the guest address. Avoid this by migrating the guest's observed addresses. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Coding style stuff, comments, etc.] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:47:17 +01:00
Stefano Brivio	5911e08c0f	migrate: Skeleton of live migration logic Introduce facilities for guest migration on top of vhost-user infrastructure. Add migration facilities based on top of the current vhost-user infrastructure, moving vu_migrate() and related functions to migrate.c. Versioned migration stages define function pointers to be called on source or target, or data sections that need to be transferred. The migration header consists of a magic number, a version number for the encoding, and a "compat_version" which represents the oldest version which is compatible with the current one. We don't use it yet, but that allows for the future possibility of backwards compatible protocol extensions. Co-authored-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:47:07 +01:00
Stefano Brivio	836fe215e0	passt-repair: Fix off-by-one in check for number of file descriptors Actually, 254 is too many, but 253 isn't. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:46:46 +01:00
Laurent Vivier	def7de4690	tcp_vu: Fix off-by one in header count array adjustment head_cnt represents the number of frames we're going to forward to the guest in tcp_vu_sock_recv(), each of which could require multiple buffers ("elements"). We initialise it with as many frames as we can find space for in vu buffers, and we then need to adjust it down to the number of frames we actually (partially) filled. We adjust it down based on number of individual buffers used by the data from recvmsg(). At this point 'i' is one greater than that number of buffers, so we need to discard all (unused) frames with a buffer index >= i, instead of > i. Reported-by: David Gibson <david@gibson.dropbear.id.au> [david: Contributed actual commit message] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:44:25 +01:00
Stefano Brivio	90f91fe726	tcp: Implement conservative zero-window probe on ACK timeout This probably doesn't cover all the cases where we should send a zero-window probe, but it's rather unobtrusive and obvious, so start from here, also because I just observed this case (without the fix from the previous patch, to take into account window information from keep-alive segments). If we hit the ACK timeout, and try re-sending data from the socket, if the window is zero, we'll just fail again, go back to the timer, and so on, until we hit the maximum number of re-transmissions and reset the connection. Don't do that: forcibly try to send something by implementing the equivalent of a zero-window probe in this case. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-12 19:43:55 +01:00
Stefano Brivio	472e2e930f	tcp: Don't discard window information on keep-alive segments It looks like a detail, but it's critical if we're dealing with somebody, such as near-future self, using TCP_REPAIR to migrate TCP connections in the guest or container. The last packet sent from the 'source' process/guest/container typically reports a small window, or zero, because the guest/container hadn't been draining it for a while. The next packet, appearing as the target sets TCP_REPAIR_OFF on the migrated socket, is a keep-alive (also called "window probe" in CRIU or TCP_REPAIR-related code), and it comes with an updated window value, reflecting the pre-migration "regular" value. If we ignore it, it might take a while/forever before we realise we can actually restart sending. Fixes: 238c69f9af45 ("tcp: Acknowledge keep-alive segments, ignore them for the rest") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-12 19:34:15 +01:00
Enrique Llorente	31e8109a86	dhcp, dhcpv6: Add hostname and client fqdn ops Both DHCPv4 and DHCPv6 has the capability to pass the hostname to clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39 (client fqdn), for some virt deployments like kubevirt is expected to have the VirtualMachine name as the guest hostname. This change add the following arguments: - -H --hostname NAME to configure the hostname DHCPv4 option(12) - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and DHCPv6(39) Signed-off-by: Enrique Llorente <ellorent@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-10 18:30:24 +01:00
Stefano Brivio	a3d142a6f6	conf: Don't map DNS traffic to host, if host gateway is a resolver This should be a relatively common case and I'm a bit surprised it's been broken since I added the "gateway mapping" functionality, but it doesn't happen with Podman, and not with systemd-resolved or similar local proxies, and also not with servers where typically the gateway is just a router and not a DNS resolver. That could be the reason why nobody noticed until now. By default, we'll map the address of the default gateway, in containers and guests, to represent "the host", so that we have a well-defined way to reach the host. Say: 0.0029: NAT to host 127.0.0.1: 192.168.100.1 But if the host gateway is also a DNS resolver: 0.0029: DNS: 0.0029: 192.168.100.1 then we'll send DNS queries directed to it to the host instead: 0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ? 0.0372: Flow 0 (TGT): INI -> TGT 0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 0.0373: Flow 0 (UDP flow): TGT -> TYPED 0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049 0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE 0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 which doesn't quite work, of course: 0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008) 0.0374: ICMP error on UDP socket 95: Connection refused unless the host is a resolver itself... but then we wouldn't find the address of the gateway in its /etc/resolv.conf, presumably. Fix this by making an exception for DNS traffic: if the default gateway is a resolver, match on DNS traffic going to the default gateway, and explicitly forward it to the configured resolver. Reported-by: Prafulla Giri <prafulla.giri@protonmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-09 08:17:06 +01:00
Stefano Brivio	864be475d9	passt-repair: Send one confirmation per command, not per socket It looks like me, myself and I couldn't agree on the "simple" protocol between passt and passt-repair. The man page and passt say it's one confirmation per command, but the passt-repair implementation had one confirmation per socket instead. This caused all sort of mysterious issues with repair mode pseudo-randomly enabled, and leading to hours of fun (mostly not mine). Oops. Switch to one confirmation per command (of course). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-09 08:16:41 +01:00
Enrique Llorente	fe8b6a7c42	dhcp: Don't re-use request message for reply The logic composing the DHCP reply message is reusing the request message to compose it, future long options like FQDN may exceed the request message limit making it go beyond the lower bound. This change creates a new reply message with a fixed options size of 308 and fills it in with proper fields from requests adding on top the generated options, this way the reply lower bound does not depend on the request. Signed-off-by: Enrique Llorente <ellorent@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-07 10:36:10 +01:00
Stefano Brivio	b7b70ba243	passt-repair: Dodge "structurally unreachable code" warning from Coverity While main() conventionally returns int, and we need a return at the end of the function to avoid compiler warnings, turning that return into _exit() to avoid exit handlers triggers a Coverity warning. It's unreachable code anyway, so switch that single occurence back to a plain return. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-07 10:35:46 +01:00
Stefano Brivio	0f009ea598	passt-repair: Fix calculation of payload length from cmsg_len There's no inverse function for CMSG_LEN(), so we need to loop over SCM_MAX_FD (253) possible input values. The previous calculation is clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg data. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-07 10:35:17 +01:00
Stefano Brivio	a0b7f56b3a	passt-repair: Don't use perror(), accept ECONNRESET as termination If we use glibc's perror(), we need to allow dup() and fcntl() in our seccomp profiles, which are a bit too much for this simple helper. On top of that, we would probably need a wrapper to avoid allocation for translated messages. While at it: ECONNRESET is just a close() from passt, treat it like EOF. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-07 10:34:31 +01:00
Stefano Brivio	a5cca995de	conf, passt.1: Un-deprecate --host-lo-to-ns-lo It was established behaviour, and it's now the third report about it: users ask how to achieve the same functionality, and we don't have a better answer yet. The idea behind declaring it deprecated to start with, I guess, was that we would eventually replace it by more flexible and generic configuration options, which is still planned. But there's nothing preventing us to alias this in the future to a particular configuration. So, stop scaring users off, and un-deprecate this. Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/ Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705 Link: https://github.com/moby/moby/issues/48838 Link: https://github.com/containers/podman/discussions/25243 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 11:14:30 +01:00
David Gibson	0da87b393b	debug: Add tcpdump to mbuto.img Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-06 09:43:09 +01:00
Stefano Brivio	f66769c2de	apparmor: Workaround for unconfined libvirtd when triggered by unprivileged user If libvirtd is triggered by an unprivileged user, the virt-aa-helper mechanism doesn't work, because per-VM profiles can't be instantiated, and as a result libvirtd runs unconfined. This means passt can't start, because the passt subprofile from libvirt's profile is not loaded either. Example: $ virsh start alpine error: Failed to start domain 'alpine' error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11 Add an annoying workaround for the moment being. Much better than encouraging users to start guests as root, or to disable AppArmor altogether. Reported-by: Prafulla Giri <prafulla.giri@protonmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-06 09:43:09 +01:00
Stefano Brivio	593be32774	passt-repair.1: Fix indication of TCP_REPAIR constants ...perhaps I should adopt the healthy habit of actually reading headers instead of using my mental copy. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 09:43:00 +01:00
Stefano Brivio	9215f68a0c	passt-repair: Build fixes for musl When building against musl headers: - sizeof() needs stddef.h, as it should be; - we can't initialise a struct msghdr by simply listing fields in order, as they contain explicit padding fields. Use field names instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 09:40:54 +01:00
Paul Holzinger	a9d63f91a5	passt-repair: use _exit() over return When returning from main it does the same as calling exit() which is not good as glibc might try to call futex() which will be blocked by seccomp. See the prevoius commit "treewide: use _exit() over exit()" for a more detailed explanation. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-05 15:19:19 +01:00
Paul Holzinger	d0006fa784	treewide: use _exit() over exit() In the podman CI I noticed many seccomp denials in our logs even though tests passed: comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000 Which is futex being called and blocked by the pasta profile. After a few tries I managed to reproduce locally with this loop in ~20 min: while :; do podman run -d --network bridge quay.io/libpod/testimage:20241011 \ sleep 100 && \ sleep 10 && \ podman rm -fa -t0 done And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the following stack trace: Stack trace of thread 1: #0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b) #1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de) #2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #3 0x000055f31b78c50b n/a (n/a + 0x0) #4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #5 0x000055f31b78d5a2 n/a (n/a + 0x0) Pasta got killed in exit(), it seems glibc is trying to use a lock when running exit handlers even though no exit handlers are defined. Given no exit handlers are needed we can call _exit() instead. This skips exit handlers and does not flush stdio streams compared to exit() which should be fine for the use here. Based on the input from Stefano I did not change the test/doc programs or qrap as they do not use seccomp filters. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-05 15:19:02 +01:00
David Gibson	745c163e60	tcp: Simplify handling of getsockname() For migration we need to get the specific local address and port for connected sockets with getsockname(). We currently open code marshalling the results into the flow entry. However, we already have inany_from_sockaddr() which handles the fiddly parts of this, so use it. Also report failures, which may make debugging problems easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Drop re-declarations of 'sa' and 'sl'] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 09:02:54 +01:00
David Gibson	b4a7b5d4a6	migrate: Fix several errors with passt-repair The passt-repair helper is now merged, but alas it contains several small bugs: * close() is not in the seccomp profile, meaning it will immediately SIGSYS when you make a request of it * The generated header, seccomp_repair.h isn't listed in .gitignore or removed by "make clean" Fixes: 8c24301462c3 ("Introduce passt-repair") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 08:52:27 +01:00
Stefano Brivio	dcf014be88	doc: Add mock of migration source and target These test programs show the migration of a TCP connection using the passt-repair helper. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	52e57f9c9a	tcp: Get socket port and address using getsockname() when connecting from guest For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target. Similar for 'oaddr': this is needed in case the migration target has additional network interfaces, and we need to make sure our socket is bound to the equivalent interface as it was on the source. Use getsockname() to fetch them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	8c24301462	Introduce passt-repair A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	e894d9ae82	vhost_user: Turn some vhost-user message reports to trace() Having every vhost-user message printed as part of debug output makes debugging anything else a bit complicated. Change per-packet debug() messages in vu_kick_cb() and vu_send_single() to trace() [dgibson: switch different messages to trace()] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	e25a93032f	util: Add read_remainder() and read_all_buf() These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy. I'll use them in the next patch. At least for the moment, they're going to be used for vhost-user mode only, so I'm not unconditionally enabling readv() in the seccomp profile: the caller has to ensure it's there. [dgibson: make read_remainder() take const pointer to iovec] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	71fa736277	tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version I explicitly added fcntl64() to the list of allowed system calls for PPC64 a while ago, and now it turns out it's not available in recent Debian builds. The warning from seccomp.sh is harmless because we unconditionally try to enable fcntl() anyway, but take care of it anyway. Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:43:35 +01:00
Stefano Brivio	b75ad159e8	vhost_user: On 32-bit ARM, mmap() is not available, mmap2() is used instead Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0 Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0 Fixes: 31117b27c6c9 ("vhost-user: introduce vhost-user API") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:42:28 +01:00
Stefano Brivio	722d347c19	tcp: Don't reset outbound connection on SYN retries Reported by somebody on IRC: if the server has considerable latency, it might happen that the client retries sending SYN segments for the same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state. In that case, we should go with the blanket assumption that we need to reset the connection on any unexpected segment: RFC 9293 explicitly mentions this case in Figure 8: Recovery from Old Duplicate SYN, section 3.5. It doesn't make sense for us to set a specific sequence number, socket-side, but we should definitely wait and see. Ignoring the duplicate SYN segment should also be compatible with section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences socket-side (which we can't do anyway), but certainly not reset the connection. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:42:13 +01:00

1 2 3 4 5 ...

1898 Commits