passt

mirror of https://passt.top/passt synced 2025-02-24 03:42:23 +00:00

Author	SHA1	Message	Date
Stefano Brivio	836fe215e0	passt-repair: Fix off-by-one in check for number of file descriptors Actually, 254 is too many, but 253 isn't. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:46:46 +01:00
Laurent Vivier	def7de4690	tcp_vu: Fix off-by one in header count array adjustment head_cnt represents the number of frames we're going to forward to the guest in tcp_vu_sock_recv(), each of which could require multiple buffers ("elements"). We initialise it with as many frames as we can find space for in vu buffers, and we then need to adjust it down to the number of frames we actually (partially) filled. We adjust it down based on number of individual buffers used by the data from recvmsg(). At this point 'i' is one greater than that number of buffers, so we need to discard all (unused) frames with a buffer index >= i, instead of > i. Reported-by: David Gibson <david@gibson.dropbear.id.au> [david: Contributed actual commit message] Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-12 19:44:25 +01:00
Stefano Brivio	90f91fe726	tcp: Implement conservative zero-window probe on ACK timeout This probably doesn't cover all the cases where we should send a zero-window probe, but it's rather unobtrusive and obvious, so start from here, also because I just observed this case (without the fix from the previous patch, to take into account window information from keep-alive segments). If we hit the ACK timeout, and try re-sending data from the socket, if the window is zero, we'll just fail again, go back to the timer, and so on, until we hit the maximum number of re-transmissions and reset the connection. Don't do that: forcibly try to send something by implementing the equivalent of a zero-window probe in this case. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-12 19:43:55 +01:00
Stefano Brivio	472e2e930f	tcp: Don't discard window information on keep-alive segments It looks like a detail, but it's critical if we're dealing with somebody, such as near-future self, using TCP_REPAIR to migrate TCP connections in the guest or container. The last packet sent from the 'source' process/guest/container typically reports a small window, or zero, because the guest/container hadn't been draining it for a while. The next packet, appearing as the target sets TCP_REPAIR_OFF on the migrated socket, is a keep-alive (also called "window probe" in CRIU or TCP_REPAIR-related code), and it comes with an updated window value, reflecting the pre-migration "regular" value. If we ignore it, it might take a while/forever before we realise we can actually restart sending. Fixes: 238c69f9af45 ("tcp: Acknowledge keep-alive segments, ignore them for the rest") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-12 19:34:15 +01:00
Enrique Llorente	31e8109a86	dhcp, dhcpv6: Add hostname and client fqdn ops Both DHCPv4 and DHCPv6 has the capability to pass the hostname to clients, the DHCPv4 uses option 12 (hostname) while the DHCPv6 uses option 39 (client fqdn), for some virt deployments like kubevirt is expected to have the VirtualMachine name as the guest hostname. This change add the following arguments: - -H --hostname NAME to configure the hostname DHCPv4 option(12) - --fqdn NAME to configure client fqdn option for both DHCPv4(81) and DHCPv6(39) Signed-off-by: Enrique Llorente <ellorent@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-10 18:30:24 +01:00
Stefano Brivio	a3d142a6f6	conf: Don't map DNS traffic to host, if host gateway is a resolver This should be a relatively common case and I'm a bit surprised it's been broken since I added the "gateway mapping" functionality, but it doesn't happen with Podman, and not with systemd-resolved or similar local proxies, and also not with servers where typically the gateway is just a router and not a DNS resolver. That could be the reason why nobody noticed until now. By default, we'll map the address of the default gateway, in containers and guests, to represent "the host", so that we have a well-defined way to reach the host. Say: 0.0029: NAT to host 127.0.0.1: 192.168.100.1 But if the host gateway is also a DNS resolver: 0.0029: DNS: 0.0029: 192.168.100.1 then we'll send DNS queries directed to it to the host instead: 0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ? 0.0372: Flow 0 (TGT): INI -> TGT 0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 0.0373: Flow 0 (UDP flow): TGT -> TYPED 0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049 0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE 0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53 which doesn't quite work, of course: 0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008) 0.0374: ICMP error on UDP socket 95: Connection refused unless the host is a resolver itself... but then we wouldn't find the address of the gateway in its /etc/resolv.conf, presumably. Fix this by making an exception for DNS traffic: if the default gateway is a resolver, match on DNS traffic going to the default gateway, and explicitly forward it to the configured resolver. Reported-by: Prafulla Giri <prafulla.giri@protonmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-09 08:17:06 +01:00
Stefano Brivio	864be475d9	passt-repair: Send one confirmation per command, not per socket It looks like me, myself and I couldn't agree on the "simple" protocol between passt and passt-repair. The man page and passt say it's one confirmation per command, but the passt-repair implementation had one confirmation per socket instead. This caused all sort of mysterious issues with repair mode pseudo-randomly enabled, and leading to hours of fun (mostly not mine). Oops. Switch to one confirmation per command (of course). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-09 08:16:41 +01:00
Enrique Llorente	fe8b6a7c42	dhcp: Don't re-use request message for reply The logic composing the DHCP reply message is reusing the request message to compose it, future long options like FQDN may exceed the request message limit making it go beyond the lower bound. This change creates a new reply message with a fixed options size of 308 and fills it in with proper fields from requests adding on top the generated options, this way the reply lower bound does not depend on the request. Signed-off-by: Enrique Llorente <ellorent@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-07 10:36:10 +01:00
Stefano Brivio	b7b70ba243	passt-repair: Dodge "structurally unreachable code" warning from Coverity While main() conventionally returns int, and we need a return at the end of the function to avoid compiler warnings, turning that return into _exit() to avoid exit handlers triggers a Coverity warning. It's unreachable code anyway, so switch that single occurence back to a plain return. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-07 10:35:46 +01:00
Stefano Brivio	0f009ea598	passt-repair: Fix calculation of payload length from cmsg_len There's no inverse function for CMSG_LEN(), so we need to loop over SCM_MAX_FD (253) possible input values. The previous calculation is clearly wrong, as not every int takes CMSG_LEN(sizeof(int)) in cmsg data. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-07 10:35:17 +01:00
Stefano Brivio	a0b7f56b3a	passt-repair: Don't use perror(), accept ECONNRESET as termination If we use glibc's perror(), we need to allow dup() and fcntl() in our seccomp profiles, which are a bit too much for this simple helper. On top of that, we would probably need a wrapper to avoid allocation for translated messages. While at it: ECONNRESET is just a close() from passt, treat it like EOF. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-07 10:34:31 +01:00
Stefano Brivio	a5cca995de	conf, passt.1: Un-deprecate --host-lo-to-ns-lo It was established behaviour, and it's now the third report about it: users ask how to achieve the same functionality, and we don't have a better answer yet. The idea behind declaring it deprecated to start with, I guess, was that we would eventually replace it by more flexible and generic configuration options, which is still planned. But there's nothing preventing us to alias this in the future to a particular configuration. So, stop scaring users off, and un-deprecate this. Link: https://archives.passt.top/passt-dev/20240925102009.62b9a0ce@elisabeth/ Link: https://github.com/rootless-containers/rootlesskit/pull/482#issuecomment-2591855705 Link: https://github.com/moby/moby/issues/48838 Link: https://github.com/containers/podman/discussions/25243 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 11:14:30 +01:00
David Gibson	0da87b393b	debug: Add tcpdump to mbuto.img Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-06 09:43:09 +01:00
Stefano Brivio	f66769c2de	apparmor: Workaround for unconfined libvirtd when triggered by unprivileged user If libvirtd is triggered by an unprivileged user, the virt-aa-helper mechanism doesn't work, because per-VM profiles can't be instantiated, and as a result libvirtd runs unconfined. This means passt can't start, because the passt subprofile from libvirt's profile is not loaded either. Example: $ virsh start alpine error: Failed to start domain 'alpine' error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11 Add an annoying workaround for the moment being. Much better than encouraging users to start guests as root, or to disable AppArmor altogether. Reported-by: Prafulla Giri <prafulla.giri@protonmail.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-06 09:43:09 +01:00
Stefano Brivio	593be32774	passt-repair.1: Fix indication of TCP_REPAIR constants ...perhaps I should adopt the healthy habit of actually reading headers instead of using my mental copy. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 09:43:00 +01:00
Stefano Brivio	9215f68a0c	passt-repair: Build fixes for musl When building against musl headers: - sizeof() needs stddef.h, as it should be; - we can't initialise a struct msghdr by simply listing fields in order, as they contain explicit padding fields. Use field names instead. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-06 09:40:54 +01:00
Paul Holzinger	a9d63f91a5	passt-repair: use _exit() over return When returning from main it does the same as calling exit() which is not good as glibc might try to call futex() which will be blocked by seccomp. See the prevoius commit "treewide: use _exit() over exit()" for a more detailed explanation. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-05 15:19:19 +01:00
Paul Holzinger	d0006fa784	treewide: use _exit() over exit() In the podman CI I noticed many seccomp denials in our logs even though tests passed: comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000 Which is futex being called and blocked by the pasta profile. After a few tries I managed to reproduce locally with this loop in ~20 min: while :; do podman run -d --network bridge quay.io/libpod/testimage:20241011 \ sleep 100 && \ sleep 10 && \ podman rm -fa -t0 done And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the following stack trace: Stack trace of thread 1: #0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b) #1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de) #2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #3 0x000055f31b78c50b n/a (n/a + 0x0) #4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e) #5 0x000055f31b78d5a2 n/a (n/a + 0x0) Pasta got killed in exit(), it seems glibc is trying to use a lock when running exit handlers even though no exit handlers are defined. Given no exit handlers are needed we can call _exit() instead. This skips exit handlers and does not flush stdio streams compared to exit() which should be fine for the use here. Based on the input from Stefano I did not change the test/doc programs or qrap as they do not use seccomp filters. Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-05 15:19:02 +01:00
David Gibson	745c163e60	tcp: Simplify handling of getsockname() For migration we need to get the specific local address and port for connected sockets with getsockname(). We currently open code marshalling the results into the flow entry. However, we already have inany_from_sockaddr() which handles the fiddly parts of this, so use it. Also report failures, which may make debugging problems easier. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Drop re-declarations of 'sa' and 'sl'] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 09:02:54 +01:00
David Gibson	b4a7b5d4a6	migrate: Fix several errors with passt-repair The passt-repair helper is now merged, but alas it contains several small bugs: * close() is not in the seccomp profile, meaning it will immediately SIGSYS when you make a request of it * The generated header, seccomp_repair.h isn't listed in .gitignore or removed by "make clean" Fixes: 8c24301462c3 ("Introduce passt-repair") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 08:52:27 +01:00
Stefano Brivio	dcf014be88	doc: Add mock of migration source and target These test programs show the migration of a TCP connection using the passt-repair helper. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	52e57f9c9a	tcp: Get socket port and address using getsockname() when connecting from guest For migration only: we need to store 'oport', our socket-side port, as we establish a connection from the guest, so that we can bind the same oport as source port in the migration target. Similar for 'oaddr': this is needed in case the migration target has additional network interfaces, and we need to make sure our socket is bound to the equivalent interface as it was on the source. Use getsockname() to fetch them. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	8c24301462	Introduce passt-repair A privileged helper to set/clear TCP_REPAIR on sockets on behalf of passt. Not used yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	e894d9ae82	vhost_user: Turn some vhost-user message reports to trace() Having every vhost-user message printed as part of debug output makes debugging anything else a bit complicated. Change per-packet debug() messages in vu_kick_cb() and vu_send_single() to trace() [dgibson: switch different messages to trace()] Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	e25a93032f	util: Add read_remainder() and read_all_buf() These are symmetric to write_remainder() and write_all_buf() and almost a copy and paste of them, with the most notable differences being reversed reads/writes and a couple of better-safe-than-sorry asserts to keep Coverity happy. I'll use them in the next patch. At least for the moment, they're going to be used for vhost-user mode only, so I'm not unconditionally enabling readv() in the seccomp profile: the caller has to ensure it's there. [dgibson: make read_remainder() take const pointer to iovec] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-04 01:28:04 +01:00
Stefano Brivio	71fa736277	tcp_splice, udp_flow: fcntl64() support on PPC64 depends on glibc version I explicitly added fcntl64() to the list of allowed system calls for PPC64 a while ago, and now it turns out it's not available in recent Debian builds. The warning from seccomp.sh is harmless because we unconditionally try to enable fcntl() anyway, but take care of it anyway. Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:43:35 +01:00
Stefano Brivio	b75ad159e8	vhost_user: On 32-bit ARM, mmap() is not available, mmap2() is used instead Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armel&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477467&raw=0 Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=armhf&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477421&raw=0 Fixes: 31117b27c6c9 ("vhost-user: introduce vhost-user API") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:42:28 +01:00
Stefano Brivio	722d347c19	tcp: Don't reset outbound connection on SYN retries Reported by somebody on IRC: if the server has considerable latency, it might happen that the client retries sending SYN segments for the same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state. In that case, we should go with the blanket assumption that we need to reset the connection on any unexpected segment: RFC 9293 explicitly mentions this case in Figure 8: Recovery from Old Duplicate SYN, section 3.5. It doesn't make sense for us to set a specific sequence number, socket-side, but we should definitely wait and see. Ignoring the duplicate SYN segment should also be compatible with section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences socket-side (which we can't do anyway), but certainly not reset the connection. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 22:42:13 +01:00
7ppKb5bW	bf2860819d	pasta.te: fix demo.sh and remove one duplicate rule On Fedora 41, without "allow pasta_t unconfined_t:dir read" /usr/bin/pasta can't open /proc/[pid]/ns, which is required by pasta_netns_quit_init(). This patch also remove one duplicate rule "allow pasta_t nsfs_t:file read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is enough. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-03 07:33:14 +01:00
Stefano Brivio	dcd6d8191a	tcp: Add HOSTSIDE(x), HOSTFLOW(x) macros Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in the next patch to extract 'oport' in order to re-bind sockets to the original socket-side local port. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-02-03 07:32:53 +01:00
David Gibson	0349cf637f	util: Rename and make global vu_remove_watch() vu_remove_watch() is used in vhost_user.c to remove an fd from the global epoll set. There's nothing really vhost user specific about it though, so rename, move to util.c and use it in a bunch of places outside vhost_user.c where it makes things marginally more readable. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-03 07:32:51 +01:00
David Gibson	10c4a9e1b3	tcp: Always pass NULL event with EPOLL_CTL_DEL In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though it will be ignored. It's possible this was a workaround for pre-2.6.9 kernels which required a non-NULL pointer here, but we rely on the kernel accepting NULL events for EPOLL_CTL_DEL in lots of other places. Use NULL instead for simplicity and consistency. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-02-03 07:32:37 +01:00
Laurent Vivier	dd6a6854c7	vhost-user: Implement an empty VHOST_USER_SEND_RARP command Passt cannot manage and doesn't need to manage the broadcast of a fake RARP, but QEMU will report an error message if Passt doesn't implement it. Implement an empty SEND_RARP command to silence QEMU error message. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-24 21:40:05 +01:00
Stefano Brivio	d477a1fb03	netlink: Skip loopback interface while looking for a template There might be reasons to have routes on the loopback interface, for example Any-IP/AnyIP routes as implemented by Linux kernel commit ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6."). If we use the loopback interface as a template, though, we'll pick 'lo' (typically) as interface name for our tap interface, but we'll already have an interface called 'lo' in the target namespace, and as we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap interface. Skip the loopback interface while looking for a template interface or, more accurately, skip the interface with index 1. Strictly speaking, we should fetch interface flags via RTM_GETLINK instead, and check for IFF_LOOPBACK, but interleaving that request while we're iterating over routes is unnecessarily complicated. Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/ Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-01-24 21:39:52 +01:00
Laurent Vivier	4f2c8e7913	vhost_user: Drop packet with unsupported iovec array If the iovec array cannot be managed, drop it rather than passing the second entry to tap_add_packet(). Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> 2025_01_21.4f2c8e7	2025-01-21 14:30:42 +01:00
Stefano Brivio	ec5c4d936d	tcp: Set PSH flag for last incoming packets in a batch So far we omitted setting PSH flags for inbound traffic altogether: as we ignore the nature of the data we're sending, we can't conclude that some data is more or less urgent. This works fine with Linux guests, as the Linux kernel doesn't do much with it, on input: it will generally deliver data to the application layer without delay. However, with Windows, things change: if we don't set the PSH flag on interactive inbound traffic, we can expect long delays before the data is delivered to the application. This is very visible with RDP, where packets we send on behalf of the RDP client are delivered with delays exceeding one second: $ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls 33170 0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0 33171 0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51 33172 0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0 33173 1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370 in this example (packet capture taken by passt), frame #33172 is a mouse event sent by the RDP client, and frame #33173 is the first event (display reacting to click) sent back by the server. This appears as a 1.4 s delay before we get frame #33173. If we set PSH, instead: $ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls 314 0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116 315 0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0 316 0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116 317 0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53 here, in frame #316, our mouse event is delivered without a delay and receives a response in approximately 12 ms. Set PSH on the last segment for any batch we dequeue from the socket, that is, set it whenever we know that we might not be sending data to the same port for a while. Reported-by: NN708 Link: https://bugs.passt.top/show_bug.cgi?id=107 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-21 14:28:44 +01:00
Stefano Brivio	db2c91ae86	tcp: Set ACK flag on all RST segments, even for client in SYN-SENT state Somewhat curiously, RFC 9293, section 3.10.7.3, states: If the state is SYN-SENT, then [...] Second, check the RST bit: - If the RST bit is set, [...] o If the ACK was acceptable, then signal to the user "error: connection reset", drop the segment, enter CLOSED state, delete TCB, and return. Otherwise (no ACK), drop the segment and return. which matches verbatim RFC 793, pages 66-67, and is implemented as-is by tcp_rcv_synsent_state_process() in the Linux kernel, that is: /* No ACK in the segment / if (th->rst) { / rfc793: * "If the RST bit is set * * Otherwise (no ACK) drop the segment and return." / goto discard_and_undo; } meaning that if a client is in SYN-SENT state, and we send a RST segment once we realise that we can't establish the outbound connection, the client will ignore our segment and will need to pointlessly wait until the connection times out instead of aborting it right away. The ACK flag on a RST, in this case, doesn't really seem to have any function, but we must set it nevertheless. The ACK sequence number is already correct because we always set it before calling tcp_prepare_flags(), whenever relevant. This leaves us with no cases where we should not* set the ACK flag on non-SYN segments, so always set the ACK flag for RST segments. Note that non-SYN, non-RST segments were already covered by commit 4988e2b40631 ("tcp: Unconditionally force ACK for all !SYN, !RST packets"). Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl> Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl> Reported-by: Robert Floor <Robert.Floor@schiphol.nl> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-21 14:28:44 +01:00
Stefano Brivio	54bb972cfb	tcp: Disable Nagle's algorithm (set TCP_NODELAY) on all sockets Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on both sides"), David argues that, in general, we don't know what kind of TCP traffic we're dealing with, on any side or path. TCP segments might have been delivered to our socket with a PSH flag, but we don't have a way to know about it. Similarly, the guest might send us segments with PSH or URG set, but we don't know if we should generally TCP_CORK sockets and uncork on those flags, because that would assume they're running a Linux kernel (and a particular version of it) matching the kernel that delivers outbound packets for us. Given that we can't make any assumption and everything might very well be interactive traffic, disable Nagle's algorithm on all non-spliced sockets as well. After all, John Nagle himself is nowadays recommending that delayed ACKs should never be enabled together with his algorithm, but we don't have a practical way to ensure that our environment is free from delayed ACKs (TCP_QUICKACK is not really usable for this purpose): https://news.ycombinator.com/item?id=34180239 Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-01-21 14:28:37 +01:00
Stefano Brivio	8757834d14	tcp: Buffer sizes are not inherited on accept()/accept4() ...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening sockets. Call tcp_sock_set_bufsize() after accept4(), for inbound sockets. As we didn't have large buffer sizes set for inbound sockets for a long time (they are set explicitly only if the maximum size is big enough, more than than the ~200 KiB default), I ran some more throughput tests for this one, and I see slightly better numbers (say, 17 gbps instead of 15 gbps guest to host without vhost-user). Fixes: 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2025-01-21 14:28:14 +01:00
Laurent Vivier	c96a88d550	vhost_user: remove ASSERT() on iovec number Replace ASSERT() on the number of iovec in the element and on the first entry length by a debug() message. Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: Fix typo in failure message] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	412ed4f09f	vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_DEVICE_STATE Report to front-end that we support device state commands: VHOST_USER_CHECK_DEVICE_STATE VHOST_USER_SET_LOG_BASE These feature is needed to transfer backend state using frontend channel. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	31d70024be	vhost-user: add VHOST_USER_SET_DEVICE_STATE_FD command Set the file descriptor to use to transfer the backend device state during migration. Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: Fixed nits and coding style here and there] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	878e163454	vhost-user: add VHOST_USER_CHECK_DEVICE_STATE command After transferring the back-end’s internal state during migration, check whether the back-end was able to successfully fully process the state. The value returned indicates success or error; 0 is success, any non-zero value is an error. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	78c73e9395	vhost-user: Report to front-end we support VHOST_USER_PROTOCOL_F_LOG_SHMFD This features allows QEMU to be migrated. We need also to report VHOST_F_LOG_ALL. This protocol feature reports we can log the page update and implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	3c1d91b816	vhost-user: add VHOST_USER_SET_LOG_BASE command Sets logging shared memory space. When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature, the log memory fd is provided in the ancillary data of VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory area provided in the message. Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: Fix coding style in a bunch of places] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	538312af19	vhost-user: Pass vu_dev to more virtio functions vu_dev will be needed to log page update. Add the parameter to: vring_used_write() vu_queue_fill_by_index() vu_queue_fill() vring_used_idx_set() vu_queue_flush() The new parameter is unused for now. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	b04195c60f	vhost-user: add VHOST_USER_SET_LOG_FD command VHOST_USER_SET_LOG_FD is an optional message with an eventfd in ancillary data, it may be used to inform the front-end that the log has been modified. Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: Fix comment to vu_set_log_fd_exec()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Laurent Vivier	6016e04a3a	vhost-user: update protocol features and commands list vhost-user protocol specification has been updated with feature flags and commands we will need to implement migration. Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: Fix comment to union vhost_user_payload] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-20 19:51:24 +01:00
Stefano Brivio	a8f4fc481c	tcp: Mask EPOLLIN altogether if we're blocked waiting on an ACK from the guest There are pretty much two cases of the (misnomer) STALLED: in one case, we could send more data to the guest if it becomes available, and in another case, we can't, because we filled the window. If, in this second case, we keep EPOLLIN enabled, but never read from the socket, we get short but CPU-annoying storms of EPOLLIN events, upon which we reschedule the ACK timeout handler, never read from the socket, go back to epoll_wait(), and so on: timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0 epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1 also known as: 29.1517: Flow 2 (TCP connection): timer expires in 2.000s 29.1517: Flow 2 (TCP connection): timer expires in 2.000s 29.1517: Flow 2 (TCP connection): timer expires in 2.000s which, for some reason, becomes very visible with muvm and aria2c downloading from a server nearby in parallel chunks. That's because EPOLLIN isn't cleared if we don't read from the socket, and even with EPOLLET, epoll_wait() will repeatedly wake us up until we actually read something. In this case, we don't want to subscribe to EPOLLIN at all: all we're waiting for is an ACK segment from the guest. Differentiate this case with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just indicate that we're waiting for an ACK from the guest (ACK_FROM_TAP_DUE), but also that we're blocked waiting for it. If this flag is set before we set STALLED, EPOLLIN will be masked while we set EPOLLET because of STALLED. Whenever we clear STALLED, we also clear this flag. This is definitely not elegant, but it's a minimal fix. We can probably simplify this at a later point by having a category of connection flags directly corresponding to epoll flags, and dropping STALLED altogether, or, perhaps, always using EPOLLET (but we need a mechanism to re-check sockets for pending data if we can't temporarily write to the guest). I suspect that this might also be implied in https://github.com/containers/podman/issues/23686, hence the Link: tag. It doesn't necessarily mean I'm fixing it (I can't reproduce that). Link: https://github.com/containers/podman/issues/23686 Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-16 21:15:33 +01:00
Stefano Brivio	b8f573cdc2	tcp: Set EPOLLET when when reading from a socket fails with EAGAIN Before SO_PEEK_OFF support was introduced by commit e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available"), we would peek data from sockets using a "discard" buffer as first iovec element, so that, unless we had no pending data at all, we would always get a positive return code from recvmsg() (except for closing connections or errors). If we couldn't send more data to the guest, in the window, we would set the STALLED flag (causing the epoll descriptor to switch to edge-triggered mode), and return early from tcp_data_from_sock(). With SO_PEEK_OFF, we don't have a discard buffer, and if there's data on the socket, but nothing beyond our current peeking offset, we'll get EAGAIN instead of our current "discard" length. In that case, we return even earlier, and we don't set EPOLLET on the socket as a result. As reported by Asahi Lina, this causes event loops where the kernel is signalling socket readiness, because there's data we didn't dequeue yet (waiting for the guest to acknowledge it), but we won't actually peek anything new, and return early without setting EPOLLET. This is the original report, mentioning the originally proposed fix: -- When there is unacknowledged data in the inbound socket buffer, passt leaves the socket in the epoll instance to accept new data from the server. Since there is already data in the socket buffer, an epoll without EPOLLET will repeatedly fire while no data is processed, busy-looping the CPU: epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4 recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable) Add in the missing EPOLLET flag for this case. This brings CPU usage down from around ~80% when downloading over TCP, to ~5% (use case: passt as network transport for muvm, downloading Steam games). -- we can't set EPOLLET unconditionally though, at least right now, because we don't monitor the guest tap for EPOLLOUT in case we fail to write on that side because we filled up that buffer (and not the window of a TCP connection). Instead, rely on the observation that, once a connection is established, we only get EAGAIN on recvmsg() if we are attempting to peek data from a socket with a non-zero peeking offset: we only peek when there's pending data on a socket, and in that case, if we peek without offset, we'll always see some data. And if we peek data with a non-zero offset and get EAGAIN, that means that we're either waiting for more data to arrive on the socket (which would cause further wake-ups, even with EPOLLET), or we're waiting for the guest to acknowledge some of it, which would anyway cause a wake-up. In that case, it's safe to set STALLED and, in turn, EPOLLET on the socket, which fixes the EPOLLIN event loop. While we're establishing a connection from the socket side, though, we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got any data while we were waiting for SYN, ACK from the guest. See the comment at the end of tcp_conn_from_sock_finish(). And if there's no data queued on the socket as we check, we'll also get EAGAIN, even if our peeking offset is zero. For this reason, we need to additionally check that 'already_sent' is not zero, meaning, explicitly, that our peeking offset is not zero. Reported-by: Asahi Lina <lina@asahilina.net> Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2025-01-16 21:15:33 +01:00

1 2 3 4 5 ...

1876 Commits