passt

mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00

Author	SHA1	Message	Date
David Gibson	f6d8dc2355	pif: Pass originating pif to tap handler functions For now, packets passed to the various *_tap_handler() functions always come from the single "tap" interface. We want to allow the possibility to broaden that in future. As preparation for that, have the code in tap.c pass the pif id of the originating interface to each of those handler functions. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-11-07 09:53:45 +01:00
Stefano Brivio	a469fc393f	tcp, tap: Don't increase tap-side sequence counter for dropped frames ...so that we'll retry sending them, instead of more-or-less silently dropping them. This happens quite frequently if our sending buffer on the UNIX domain socket is heavily constrained (for instance, by the 208 KiB default memory limit). It might be argued that dropping frames is part of the expected TCP flow: we don't dequeue those from the socket anyway, so we'll eventually retransmit them. But we don't need the receiver to tell us (by the way of duplicate or missing ACKs) that we couldn't send them: we already know as sendmsg() reports that. This seems to considerably increase throughput stability and throughput itself for TCP connections with default wmem_max values. Unfortunately, the 16 bits left as padding in the frame descriptors we use internally aren't enough to uniquely identify for which connection we should update sequence numbers: create a parallel array of pointers to sequence numbers and L4 lengths, of TCP_FRAMES_MEM size, and go through it after calling sendmsg(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-10-04 23:39:58 +02:00
David Gibson	6471c7d01b	cppcheck: Make many pointers const Newer versions of cppcheck (as of 2.12.0, at least) added a warning for pointers which could be declared to point at const data, but aren't. Based on that, make many pointers throughout the codebase const. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-10-04 23:23:35 +02:00
David Gibson	7b56117dae	udp, tap: Correctly advance through packets in udp_tap_handler() In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each DUP sequence calling udp_tap_handler(). Or so it appears. In fact, udp_tap_handler() doesn't take an index and always starts with packet 0 of the sequence, even if called repeatedly. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works. Correct this by adding an index parameter to udp_tap_handler() and altering the loops in tap.c to step through the pool properly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-09-08 09:16:04 +02:00
David Gibson	043a70b885	tcp, tap: Correctly advance through packets in tcp_tap_handler() In both tap4_handler() and tap6_handler(), once we've sorted incoming l3 packets into "sequences", we then step through all the packets in each TCP sequence calling tcp_tap_handler(). Or so it appears. In fact, tcp_tap_handler() doesn't take an index and always looks at packet 0 of the sequence, except when it calls tcp_data_from_tap() to process data packets. It appears to be written with the idea that the struct pool is a queue, from which it consumes packets as it processes them, but that's not how the pool data structure works - they are more like an array of packets. We only get away with this, because setup packets for TCP tend to come in separate batches (because we need to reply in between) and so we only get a bunch of packets for the same connection together when they're data packets (tcp_data_from_tap() has its own loop through packets). Correct this by adding an index parameter to tcp_tap_handler() and altering the loops in tap.c to step through the pool properly. Link: https://bugs.passt.top/show_bug.cgi?id=68 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-09-08 09:15:46 +02:00
Stas Sergeev	d8c4f23ecd	tap: fix uses of l3_len in tap4_handler() l3_len was calculated from the ethernet frame size, and it was assumed to be equal to the length stored in an IP packet. But if the ethernet frame is padded, then l3_len calculated that way can only be used as a bound check to validate the length stored in an IP header. It should not be used for calculating the l4_len. This patch makes sure the small padded ethernet frames are properly processed, by trusting the length stored in an IP header. Link: https://bugs.passt.top/show_bug.cgi?id=73 Signed-off-by: Stas Sergeev <stsp2@yandex.ru> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-09-07 11:24:14 +02:00
David Gibson	955dd3251c	tcp, udp: Don't pre-fill IPv4 destination address in headers Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-22 12:15:33 +02:00
David Gibson	cee4a2da48	tap: Pass source address to protocol handler functions The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-22 12:15:21 +02:00
David Gibson	673bde1f21	tap: Don't clobber source address in tap6_handler() In tap6_handler() saddr is initialized to the IPv6 source address from the incoming packet. However part way through, but before organizing the packet into a "sequence" we set it unconditionally to the guest's assigned address. We don't do anything equivalent for IPv4. This doesn't make a lot of sense: if the guest is using a different source address it makes sense to consider these different sequences of packets and we shouldn't try to combine them together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-22 12:15:17 +02:00
Stefano Brivio	5f1fcfffe4	tap: Fix format specifier in tap4_is_fragment() warning Spotted by Coverity, relatively harmless. Fixes: `e01759e2fa` ("tap: Explicitly drop IPv4 fragments, and give a warning") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-08-16 10:46:07 +02:00
David Gibson	ae5f6c8e1b	epoll: Use different epoll types for passt and pasta tap fds Currently we have a single epoll event type for the "tap" fd, which could be either a handle on a /dev/net/tun device (pasta) or a connected Unix socket (passt). However for the two modes we call different handler functions. Simplify this a little by using different epoll types and dispatching directly to the correct handler function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:20 +02:00
David Gibson	eda4f1997e	epoll: Split listening Unix domain socket into its own type tap_handler() actually handles events on three different types of object: the /dev/tap character device (pasta), a connected Unix domain socket (passt) or a listening Unix domain socket (passt). The last, in particular, really has no handling in common with the others, so split it into its own epoll type and directly dispatch to the relevant handler from the top level. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:30:17 +02:00
David Gibson	6a6735ece4	epoll: Always use epoll_ref for the epoll data variable epoll_ref contains a variety of information useful when handling epoll events on our sockets, and we place it in the epoll_event data field returned by epoll. However, for a few other things we use the 'fd' field in the standard union of types for that data field. This actually introduces a bug which is vanishingly unlikely to hit in practice, but very nasty if it ever did: theoretically if we had a very large file descriptor number for fd_tap or fd_tap_listen it could overflow into bits that overlap with the 'proto' field in epoll_ref. With some very bad luck this could mean that we mistakenly think an event on a regular socket is an event on fd_tap or fd_tap_listen. More practically, using different (but overlapping) fields of the epoll_data means we can't unify dispatch for the various different objects in the epoll. Therefore use the same epoll_ref as the data for the tap fds and the netns quit fd, adding new fd type values to describe them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:53 +02:00
David Gibson	e26282b67d	tap: Fold reset handling into tap_handler_passt() We call tap_sock_reset() if tap_handler_passt() fails, or if we get an error event on the socket. Fold that logic into tap_handler() passt itself which simplifies the caller. It also makes it clearer that we had a redundant EPOLL_CTL_DEL and close() in one of the reset paths, so fix that too. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:49 +02:00
David Gibson	0d870c5da6	tap: Fold reset handling into tap_handler_pasta() If tap_handler_pasta() fails, we reset the connection. But in the case of pasta the "reset" is just a fatal error. Fold the die() calls directly into tap_handler_pasta() for simplicity. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:46 +02:00
David Gibson	548e05f76a	tap: Clean up behaviour for errors on listening Unix socket We call tap_sock_unix_new() to handle a new connection to the qemu socket if we get an EPOLLIN event on c->fd_tap_listen. If we get any other event on the fd, we'll fall through to the "tap reset" path. But that won't do anything relevant to the listening socket, it will just close the already connected socket. Furthermore, the only other event we're subscribed to for the listening socket is EPOLLRDHUP, which doesn't apply to a non connected socket. Remove EPOLLRDHUP from the subscribed events. We don't need to explicitly add EPOLLERR, because errors are always reported. There's no obvious case that would cause an error on a listening socket anyway, and it's not obvious how we'd recover, treat it as a fatal error if it ever does happen. Finally, fold all this handling into the tap_sock_unix_new() function, there's no real reason to split it between there and tap_handler(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:44 +02:00
David Gibson	28877b0fcd	tap: Clean up tap reset path In tap_handler() if we get an error on the tap device or socket, we use tap_sock_init() to re-initialise it. However, what we actually need for this reset case has remarkably little in common with the case where we're initialising for the first time: * Re-initialising the packet pools is unnecessary * The case of a passed in fd (--fd) isn't relevant * We don't even call this for pasta mode * We will never re-call tap_sock_unix_init() because we never clear fd_tap_listen In fact the only thing we do in tap_sock_init() relevant to the reset case is to remove the fd from the epoll and close it... which isn't used in the first initialisation case. So make a new tap_sock_reset() function just for this case, and simplify tap_sock_init() slightly as being used only for the first time case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:29:40 +02:00
Laurent Vivier	b2bea0047d	tap: fix seq->p.count limit The number of items in pool_l4_t is defined to UIO_MAXIOV, not TAP_SEQS. TAP_SEQS is the number of the sequences. Fix the value used to compare seq->p.count with. Fixes: `37c228ada8` ("tap, tcp, udp, icmp: Cut down on some oversized buffers") Signed-off-by: Laurent Vivier <lvivier@redhat.com> [sbrivio: s/messages/sequences/ in commit message, extend initialisation of packets in pool to UIO_MAXIOV items] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-13 17:24:56 +02:00
David Gibson	0cf7bf31f6	tap: Remove unnecessary global tun_ns_fd tap_ns_tun(), which runs in an ephemeral thread puts the fd it opens into the global variable tun_ns_fd to communicate it back to the main thread in tap_sock_tun_init(). However, the only thing tap_sock_tun_init() does with it is copies it to c->fd_tap and everything else uses it from there. tap_ns_tun() already has access to the context structure, so we might as well store the value directly in there rather than having a global as an intermediate. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:11 +02:00
David Gibson	7bc9b66fc2	tap: More detailed error reporting in tap_ns_tun() There are several possible failure points in tap_ns_tun(), but if anything goes wrong, we just set tun_ns_fd to -1 resulting in the same error message. Add more detailed error reporting to the various failure points. At the same time, we know this is only called from tap_sock_tun_init() which will terminate pasta if we fail, so we can simplify things a little because we don't need to close() the fd on the failure paths. Link: https://bugs.passt.top/show_bug.cgi?id=69 Link: https://github.com/containers/podman/issues/19428 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:06 +02:00
David Gibson	6920adda0d	util: Make ns_enter() a void function and report setns() errors ns_enter() returns an integer... but it's always zero. If we actually fail the function doesn't return. Therefore it makes more sense for this to be a function returning void, and we can remove the cases where we pointlessly checked its return value. In addition ns_enter() is usually called from an ephemeral thread created by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't be reported (since NS_CALL() doesn't wait() for the thread). So, use die() instead to print out some information in the unlikely event that our setns() here does fail. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-08-04 01:18:02 +02:00
David Gibson	e01759e2fa	tap: Explicitly drop IPv4 fragments, and give a warning We don't handle defragmentation of IP packets coming from the tap side, and we're unlikely to any time soon (with our large MTU, it's not useful for practical use cases). Currently, however, we simply ignore the fragmentation flags and treat fragments as though they were whole IP packets. This isn't ideal and can lead to rather cryptic behaviour if we do receive IP fragments. Change the code to explicitly drop fragmented packets, and print a rate limited warning if we do encounter them. Link: https://bugs.passt.top/show_bug.cgi?id=62 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-07-07 19:07:12 +02:00
Stefano Brivio	d072ac2434	tap: With pasta, don't reset on tap errors, handle write failures Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), it makes no sense to close and reopen the tap device on error: we don't have access to /dev/net/tun after the initial setup phase. If we hit ENOBUFS while writing (as reported: in one case because the kernel actually ran out of memory, with another case under investigation), or ENOSPC, we're supposed to drop whatever data we were trying to send: there's no room for it. Handle EINTR just like we handled EAGAIN/EWOULDBLOCK: there's no particular reason why sending the same data should fail again. Anything else I can think of would be an unrecoverable error: exit with failure then. While at it, drop a useless cast on the write() call: it takes a const void * anyway. Reported-by: Gianluca Stivan <me@yawnt.com> Reported-by: Chris Kuhn <kuhnchris@kuhnchris.eu> Fixes: `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-06-23 10:15:10 +02:00
David Gibson	25f1d1a84f	tap: Don't update ip6.addr_seen to :: When we receive packets from the tap side, we update the addr_seen fields to reflect the last known address of the guest or ns. For ip4.addr_seen we, sensibly, only update if the address we've just seen isn't 0 (0.0.0.0). This case can occur during early DHCP transactions. We have no equivalent case for IPv6. We're less likely to hit this, because DHCPv6 uses link-local addresses, however we can see an source address of :: with certain multicast operations. This can bite us if we try to make an incoming connection very early after starting pasta with --config-net: we may have only seen some of those multicast packets, updated addr_seen to :: and not had any "real" packets to update it to a global address. I've seen this with some of the avocado test conversions. In any case, it can never make sense to update addr_seen to ::, so explicitly exclude that case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-05-17 18:50:34 +02:00
Stefano Brivio	ca2749e1bd	passt: Relicense to GPL 2.0, or any later version In practical terms, passt doesn't benefit from the additional protection offered by the AGPL over the GPL, because it's not suitable to be executed over a computer network. Further, restricting the distribution under the version 3 of the GPL wouldn't provide any practical advantage either, as long as the passt codebase is concerned, and might cause unnecessary compatibility dilemmas. Change licensing terms to the GNU General Public License Version 2, or any later version, with written permission from all current and past contributors, namely: myself, David Gibson, Laine Stump, Andrea Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-04-06 18:00:33 +02:00
David Gibson	34ade90957	Work around weird false positives with cppcheck-2.9.1 Commit `89e38f55` "treewide: Fix header includes to build with musl" added extra #includes to work with musl. Unfortunately with the cppcheck version I'm using (cppcheck-2.9-1.fc37.x86_64 in Fedora 37) this causes weird false positives: specifically cppcheck seems to hit a #error in <bits/unistd.h> complaining about including it directly instead of via <unistd.h> (which is not something we're doing). I have no idea why that would be happening; but I'm guessing it has to be a bug in the cpp implementation in that cppcheck version. In any case, it's possible to work around this by moving the include of <unistd.h> before the include of <signal.h>. So, do that. Fixes: `89e38f5540` ("treewide: Fix header includes to build with musl") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-03-21 16:38:06 +01:00
Chris Kuhn	89e38f5540	treewide: Fix header includes to build with musl Roughly inspired from a patch by Chris Kuhn: fix up includes so that we can build against musl: glibc is more lenient as headers generally include a larger amount of other headers. Compared to the original patch, I only included what was needed directly in C files, instead of adding blanket includes in local header files. It's a bit more involved, but more consistent with the current (not ideal) situation. Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-03-09 03:44:21 +01:00
Stefano Brivio	d2df763232	log, conf, tap: Define die() as err() plus exit(), drop cppcheck workarounds If we define die() as a variadic macro, passing __VA_ARGS__ to err(), and calling exit() outside err() itself, we can drop the workarounds introduced in commit `36f0199f6e` ("conf, tap: Silence two false positive invalidFunctionArg from cppcheck"). Suggested-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-27 18:55:57 +01:00
Stefano Brivio	36f0199f6e	conf, tap: Silence two false positive invalidFunctionArg from cppcheck The newly introduced die() calls exit(), but cppcheck doesn't see it and warns about possibly invalid arguments used after the check which triggers die(). Add return statements to silence the warnings. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 19:19:23 +01:00
David Gibson	42bfd212b1	tap: Eliminate goto from tap_handler() The goto here really doesn't improve clarity or brevity at all. Use a clearer construct. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 18:56:31 +01:00
David Gibson	b62ed9ca0e	tap: Don't pcap frames that didn't get sent In tap_send_frames() we send a number of frames to the tap device, then also write them to the pcap capture file (if configured). However the tap send can partially fail (short write()s or similar), meaning that some of the requested frames weren't actually sent, but we still write those frames to the capture file. We do give a debug message in this case, but it's misleading to add frames that we know weren't sent to the capture file. Rework to avoid this. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 18:56:20 +01:00
Stefano Brivio	7564b58a7f	tap: Use single counter for iov elements in tap_send_frames_pasta() David points out that using multiple counters to go over the iov array, namely 'i' and 'iov', makes mistakes easier. We can't just use 'iov', unless we reserve an element with zero iov_len at the end, which isn't really justified. Simply use 'i' to iterate over the array. Link: https://archives.passt.top/passt-dev/Y+mfenvLn3VJ7Dg5@yekko/ Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2023-02-16 17:33:54 +01:00
Laine Stump	c9af6f92db	convert all remaining err() followed by exit() to die() This actually leaves us with 0 uses of err(), but someone could want to use it in the future, so we may as well leave it around. Signed-off-by: Laine Stump <laine@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-16 17:32:27 +01:00
Stefano Brivio	ac153595c0	tap: Send frames after the first one in tap_send_frames_pasta() ...instead of repeatedly sending out the first one in iov. Fixes: `e21ee41ac3` ("tcp: Combine two parts of pasta tap send path together") Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-02-12 14:34:38 +01:00
David Gibson	2d553b587a	tap: Improve handling of partial frame sends In passt mode, when writing frames to the qemu socket, we might get a short send. If we ignored this and carried on, the qemu socket would get out of sync, because the bytes we actually sent wouldn't correspond to the length header we already sent. tap_send_frames_passt() handles that by doing a a blocking send to complete the message, but it has a few flaws: * We only attempt to resend once: although it's unlikely in practice, nothing prevents the blocking send() from also being short * We print a debug error if send() returns non-zero.. but send() returns the number of bytes sent, so we actually want it to return the length of the remaining data. Correct those flaws and also be a bit more thorough about reporting problems here. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:55:01 +01:00
David Gibson	0fb7b2b908	tap: Use different io vector bases depending on tap type Currently tap_send_frames() expects the frames it is given to include the vnet_len field, even in pasta mode which doesn't use it (although it need not be initialized in that case). To match, tap_iov_base() and tap_iov_len() construct the frame in that way. This will inconvenience future changes, so alter things to set the buffers to include just the frame needed by the tap backend type. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:57 +01:00
David Gibson	4b3d38a069	tap: Add "tap headers" abstraction Currently both the TCP and UDP code need to deal in various places with the details of the L2 headers, and also the tap-specific "vnet_len" header. This makes abstracting the tap interface to new backends (e.g. vhost-user or tun) more difficult. To improve this abstraction, create a new 'tap_hdr' structure which represents both L2 (always Ethernet at the moment, but might be vary in future) and any additional tap specific headers (such as the qemu socket's vnet_len field). Provide helper functions and macros to initialize, update and use it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:52 +01:00
David Gibson	6d011c1faa	tap, tcp: Move tap send path to tap.c The functions which do the final steps of sending TCP packets on through the tap interface - tcp_l2_buf_flush*() - no longer have anything that's actually specific to TCP in them, other than comments and names. Move them all to tap.c. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2023-01-23 18:54:40 +01:00
Richard W.M. Jones	190169c544	passt, tap: Process data on the socket before HUP/ERR events In the case where the client writes a packet and then closes the socket, because we receive EPOLLIN\|EPOLLRDHUP together we have a choice of whether to close the socket immediately, or read the packet and then close the socket. Choose the latter. This should improve fuzzing coverage and arguably is a better choice even for regular use since dropping packets on close is bad. See-also: https://archives.passt.top/passt-dev/20221117171805.3746f53a@elisabeth/ Signed-off-by: Richard W.M. Jones <rjones@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:40:57 +01:00
Richard W.M. Jones	6b4e68383c	passt, tap: Add --fd option This passes a fully connected stream socket to passt. Signed-off-by: Richard W.M. Jones <rjones@redhat.com> [sbrivio: reuse fd_tap instead of adding a new descriptor, imply --one-off on --fd, add to optstring and usage()] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:40:47 +01:00
David Gibson	698e4fd761	style: Minor corrections to function comments Some style issues and a typo. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-25 01:34:26 +01:00
Stefano Brivio	25dab96205	tap: Revert recently added checks in tap_handler_passt() This reverts commit `198f87835d` ("tap: Return -EIO from tap_handler_passt() on inconsistent packet stream") and commit `510dace86c` ("tap: Keep stream consistent if qemu length descriptor spans two recv() calls"). I can hit occasional failures in perf/passt_tcp tests where we seem to be getting excess data at the end of a recv(), and for some reason I couldn't figure out yet, if we just ignore it, subsequent recv() calls from qemu return correct data. If we close the connection, qemu can't talk to us anymore, of course. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-16 15:11:22 +01:00
Stefano Brivio	b27d6d121c	arp, tap, util: Don't use perror() after seccomp filter is installed If stderr is closed, after we fork to background, glibc's implementation of perror() will try to re-open it by calling dup(), upon which the seccomp filter causes the process to terminate, because dup() is not included in the list of allowed syscalls. Replace perror() calls that might happen after isolation_postfork(). We could probably replace all of them, but early ones need a bit more attention as we have to check whether log.c functions work in early stages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-16 15:11:13 +01:00
Stefano Brivio	198f87835d	tap: Return -EIO from tap_handler_passt() on inconsistent packet stream While it's important to fail in that case, it makes little sense to fail quietly: it's better to tell qemu explicitly that something went wrong and that we won't recover, by closing the socket. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-10 11:17:50 +01:00
Stefano Brivio	510dace86c	tap: Keep stream consistent if qemu length descriptor spans two recv() calls I got all paranoid after triggering a divide-by-zero general protection fault in passt with a qemu version without the virtio_net TX hang fix, while flooding UDP. I start thinking this was actually coming from some random changes I was playing with, but before reaching this conclusion I reviewed once more the relatively short path in tap_handler_passt() before we start using packet_*() functions, and found this. Never observed in practice, but artificially reproduced with changes in qemu's socket interface: if we don't receive from qemu a complete length descriptor in one recv() call, or if we receive a partial one at the end of one call, we currently disregard the rest, which would make the stream inconsistent. Nothing really bad happens, except that from that point on we would disregard all the packets we get until, if ever, we get the stream back in sync by chance. Force reading a complete packet length descriptor with a blocking recv(), if needed -- not just a complete packet later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-10 11:17:50 +01:00
Stefano Brivio	11efaefa1e	passt, qrap, README: Update notes and documentation for AF_UNIX support in qemu We can't get rid of qrap quite yet, but at least we should start telling users it's not going to be needed anymore starting from qemu 7.2. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:32 +01:00
David Gibson	f7653a1446	Use endian-safer typing in struct tap4_l4_t We recently converted to using struct in_addr rather than bare in_addr_t or uint32_t to represent IPv4 addresses in network order. This makes it harder forget to apply the correct endian conversions. We omitted the IPv4 addresses stored in struct tap4_l4_t, however. Convert those as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:26 +01:00
David Gibson	7c7b68dbe0	Use typing to reduce chances of IPv4 endianness errors We recently corrected some errors handling the endianness of IPv4 addresses. These are very easy errors to make since although we mostly store them in network endianness, we sometimes need to manipulate them in host endianness. To reduce the chances of making such mistakes again, change to always using a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network endian addresses. This makes it harder to accidentally do arithmetic or comparisons on such addresses as if they were host endian. We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to directly work with struct in_addr values. This has the additional benefit of making the IPv4 and IPv6 paths more visually similar. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-11-04 12:04:24 +01:00
Stefano Brivio	2d4468ebb7	tap: Support for detection of existing sockets on ramfs On ramfs, connecting to a non-existent UNIX domain socket yields EACCESS, instead of ENOENT. This is visible if we use passt directly on rootfs (a ramfs instance) from an initramfs image. It's probably wrong for ramfs to return EACCES, but given the simplicity of the filesystem, I doubt we should try to fix it there at the possible cost of added complexity. Also, this whole beauty should go away once qrap-less usage is established, so just accept EACCES as indication that a conflicting socket does not, in fact, exist. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-11-04 12:01:09 +01:00
Stefano Brivio	947d756747	tap: Trace received (outbound) ICMP packets in debug mode, too This only worked for ICMPv6: ICMP packets have no TCP-style header, so they are handled as a special case before packet sequences are formed, and the call to tap_packet_debug() was missing. Fixes: `bb70811183` ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-27 00:18:16 +02:00
David Gibson	c6845f60a0	dhcp: Use tap_udp4_send() helper in dhcp() The IPv4 specific dhcp() manually constructs L2 and IP headers to send its DHCP reply packet, unlike its IPv6 equivalent in dhcpv6.c which uses the tap_udp6_send() helper. Now that we've broaded the parameters to tap_udp4_send() we can use it in dhcp() to avoid some duplicated logic. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:35:00 +02:00
David Gibson	2dbc622f54	tap: Split tap_ip4_send() into UDP and ICMP variants tap_ip4_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp4_send() and tap_icmp4_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv4 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:56 +02:00
David Gibson	cb1edae3b5	ndp: Remove unneeded eh_source parameter ndp() takes a parameter giving the ethernet source address of the packet it is to respond to, which it uses to determine the destination address to send the reply packet to. This is not necessary, because the address will always be the guest's MAC address. Even if the guest has just changed MAC address, then either tap_handler_passt() or tap_handler_pasta() - which are the only call paths leading to ndp() will have updated c->mac_guest with the new value. So, remove the parameter, and just use c->mac_guest, making it more consistent with other paths where we construct packets to send inwards. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:51 +02:00
David Gibson	9d8dd8b6f4	tap: Split tap_ip6_send() into UDP and ICMP variants tap_ip6_send() has special case logic to compute the checksums for UDP and ICMP packets, which is a mild layering violation. By using a suitable helper we can split it into tap_udp6_send() and tap_icmp6_send() functions without greatly increasing the code size, this removing that layering violation. We make some small changes to the interface while there. In both cases we make the destination IPv6 address a parameter, which will be useful later. For the UDP variant we make it take just the UDP payload, and it will generate the UDP header. For the ICMP variant we pass in the ICMP header as before. The inconsistency is because that's what seems to be the more natural way to invoke the function in the callers in each case. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:48 +02:00
David Gibson	f616ca231e	Split tap_ip_send() into IPv4 and IPv6 specific functions The IPv4 and IPv6 paths in tap_ip_send() have very little in common, and it turns out that every caller (statically) knows if it is using IPv4 or IPv6. So split into separate tap_ip4_send() and tap_ip6_send() functions. Use a new tap_l2_hdr() function for the very small common part. While we're there, make some minor cleanups: - We were double writing some fields in the IPv6 header, so that it temporary matched the pseudo-header for checksum calculation. With recent checksum reworks, this isn't neccessary any more. - We don't use any IPv4 header options, so use some sizeof() constructs instead of some open coded values for header length. - The comment used to say that the flow label was for TCP over IPv6, but in fact the only thing we used it for was DHCPv6 over UDP traffic Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:45 +02:00
David Gibson	fb5d1c5d7d	tap: Remove unhelpeful vnet_pre optimization from tap_send() Callers of tap_send() can optionally use a small optimization by adding extra space for the 4 byte length header used on the qemu socket interface. tap_ip_send() is currently the only user of this, but this is used only for "slow path" ICMP and DHCP packets, so there's not a lot of value to the optimization. Worse, having the two paths here complicates the interface and makes future cleanups difficult, so just remove it. I have some plans to bring back the optimization in a more general way in future, but for now it's just in the way. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:43 +02:00
David Gibson	f72b63e92f	Remove support for TCP packets from tap_ip_send() tap_ip_send() is never used for TCP packets, we're unlikely to use it for that in future, and the handling of TCP packets makes other cleanups unnecessarily awkward. Remove it. This is the only user of csum_tcp4(), so we can remove that as well. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:40 +02:00
David Gibson	a2eb2d310a	Add helpers for normal inbound packet destination addresses tap_ip_send() doesn't take a destination address, because it's specifically for inbound packets, and the IP addresses of the guest/namespace are already known to us. Rather than open-coding this destination address logic, make helper functions for it which will enable some later cleanups. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:38 +02:00
David Gibson	3d8ccb44a6	Add csum_ip4_header() helper to calculate IPv4 header checksums We calculate IPv4 header checksums in at least two places, in dhcp() and in tap_ip_send. Add a helper to handle this calculation in both places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:34 +02:00
David Gibson	bd4be308fc	Add csum_udp4() helper for calculating UDP over IPv4 checksums At least two places in passt fill in UDP over IPv4 checksums, although since UDP checksums are optional with IPv4 that just amounts to storing a 0 (in tap_ip_send()) or leaving a 0 from an earlier initialization (in dhcp()). For consistency, add a helper for this "calculation". Just for the heck of it, add the option (compile time disabled for now) to calculate real UDP checksums. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:32 +02:00
David Gibson	6905ac75ec	Add csum_udp6() helper for calculating UDP over IPv6 checksums Add a helper for calculating UDP checksums when used over IPv6 For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the UDP header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:29 +02:00
David Gibson	67ab617172	Add csum_icmp4() helper for calculating ICMP checksums Although tap_ip_send() is currently the only place calculating ICMP checksums, create a helper function for symmetry with ICMPv6. For future flexibility it allows the ICMPv6 header and payload to be in separate buffers. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:26 +02:00
David Gibson	7abd2b0d72	Add csum_icmp6() helper for calculating ICMPv6 checksums At least two places in passt calculate ICMPv6 checksums, ndp() and tap_ip_send(). Add a helper to handle this calculation in both places. For future flexibility, the new helper takes parameters for the fields in the IPv6 pseudo-header, so an IPv6 header or pseudo-header doesn't need to be explicitly constructed. It also allows the ICMPv6 header and payload to be in separate buffers, although we don't use this yet. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-19 03:34:21 +02:00
Stefano Brivio	3e2eb4337b	conf: Bind inbound ports with CAP_NET_BIND_SERVICE before isolate_user() Even if CAP_NET_BIND_SERVICE is granted, we'll lose the capability in the target user namespace as we isolate the process, which means we're unable to bind to low ports at that point. Bind inbound ports, and only those, before isolate_user(). Keep the handling of outbound ports (for pasta mode only) after the setup of the namespace, because that's where we'll bind them. To this end, initialise the netlink socket for the init namespace before isolate_user() as well, as we actually need to know the addresses of the upstream interface before binding ports, in case they're not explicitly passed by the user. As we now call nl_sock_init() twice, checking its return code from conf() twice looks a bit heavy: make it exit(), instead, as we can't do much if we don't have netlink sockets. While at it: - move the v4_only && v6_only options check just after the first option processing loop, as this is more strictly related to option parsing proper - update the man page, explaining that CAP_NET_BIND_SERVICE is not the preferred way to bind ports, because passt and pasta can be abused to allow other processes to make effective usage of it. Add a note about the recommended sysctl instead - simplify nl_sock_init_do() now that it's called once for each case Reported-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
Stefano Brivio	2970dc257c	tap: Don't check sequence counts when adding packets to pool This is a minor optimisation possibility I spotted while trying to debug a hang in tap4_handler(): if we run out of space for packet sequences, it's fine to add packets to an existing per-sequence pool. We should check the count of packet sequences only once we realise that we actually need a new packet sequence. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-10-15 02:10:36 +02:00
Stefano Brivio	a62ed181db	conf, tap: Add option to quit once the client closes the connection This is practical to avoid explicit lifecycle management in users, e.g. libvirtd, and is trivial to implement. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-15 02:10:36 +02:00
Stefano Brivio	da152331cf	Move logging functions to a new file, log.c Logging to file is going to add some further complexity that we don't want to squeeze into util.c. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-10-14 17:38:25 +02:00
David Gibson	eb5e123038	cppcheck: Reduce scope of some variables Minor style improvement suggested by cppcheck. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-09-29 12:22:01 +02:00
Stefano Brivio	d30bde3181	tap: Check return value of accept4() before calling getsockopt() Reported by Coverity (CWE-119): Negative value used as argument to a function expecting a positive value (for example, size of buffer or allocation) and harmless, because getsockopt() would return -EBADF if the socket is -1, so we wouldn't print anything. Check if accept4() returns a valid socket before calling getsockopt() on it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2022-09-22 16:54:01 +02:00
David Gibson	16f5586bb8	Make substructures for IPv4 and IPv6 specific context information The context structure contains a batch of fields specific to IPv4 and to IPv6 connectivity. Split those out into a sub-structure. This allows the conf_ip4() and conf_ip6() functions, which take the entire context but touch very little of it, to be given more specific parameters, making it clearer what it affects without stepping through the code. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-07-30 22:14:07 +02:00
David Gibson	5e12d23acb	Separate IPv4 and IPv6 configuration After recent changes, conf_ip() now has essentially entirely disjoint paths for IPv4 and IPv6 configuration. So, it's cleaner to split them out into different functions conf_ip4() and conf_ip6(). Splitting these out also lets us make the interface a bit nicer, having them return success or failure directly, rather than manipulating c->v4 and c->v6 to indicate success/failure of the two versions. Since these functions may also initialize the interface index for each protocol, it turns out we can then drop c->v4 and c->v6 entirely, replacing tests on those with tests on whether c->ifi4 or c->ifi6 is non-zero (since a 0 interface index is never valid). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [sbrivio: Whitespace fixes] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-07-30 22:12:50 +02:00
David Gibson	4bc883aeab	Move passt mac_guest init to be more symmetric with pasta In pasta mode, the guest's MAC address is set up in pasta_ns_cobf() called from tap_sock_tun_init(). If we have a guest MAC configured with --ns-mac-addr, this will set the given MAC on the kernel tuntap device, or if we haven't configured one it will update our record of the guest MAC to the kernel assigned one from the device. For passt, we don't initially know the guest's MAC until we receive packets from it, so we have to initially use a broadcast address. This is - oddly - set up in an entirely different place, in conf_ip() conditional on the mode. Move it to the logically matching place for passt - tap_sock_unix_init(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au>	2022-07-30 21:57:50 +02:00
Stefano Brivio	7af0bae4ce	tap: Add informational messages for UNIX domain socket connections ...namely, as connections are discarded or accepted. This was quite useful to debug an issue with libvirtd failing to start qemu (because passt refused the new connection) as a previous qemu instance was still active. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-06-18 09:06:00 +02:00
Stefano Brivio	ceddcac74a	conf, tap: False "Buffer not null terminated" positives, CWE-170 Those strings are actually guaranteed to be NULL-terminated. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	eb3d3f367e	treewide: Argument cannot be negative, CWE-687 Actually harmless. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	48bc843d6e	tap: Resource leak, CWE-404 Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	22ed4467a4	treewide: Unchecked return value from library, CWE-252 All instances were harmless, but it might be useful to have some debug messages here and there. Reported by Coverity. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-04-07 11:44:35 +02:00
Stefano Brivio	8fd20ad99d	tap: Re-read from tap in tap_handler_pasta() on buffer full read() will return zero if we pass a zero length, which makes no sense: instead, track explicitly that we exhausted the buffer, flush packets to handlers and redo. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-30 05:50:17 +02:00
Stefano Brivio	8d85b6a99e	tap: Allow ioctl() and openat() for tap_ns_tun() re-initialisation If the tun interface disappears, we'll call tap_ns_tun() after the seccomp profile is applied: add ioctl() and openat() to it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-30 05:49:46 +02:00
Stefano Brivio	37c228ada8	tap, tcp, udp, icmp: Cut down on some oversized buffers The existing sizes provide no measurable differences in throughput and packet rates at this point. They were probably needed as batched implementations were not complete, but they can be decreased quite a bit now. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	62c3edd957	treewide: Fix android-cloexec-* clang-tidy warnings, re-enable checks Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	48582bf47f	treewide: Mark constant references as const Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	bb70811183	treewide: Packet abstraction with mandatory boundary checks Implement a packet abstraction providing boundary and size checks based on packet descriptors: packets stored in a buffer can be queued into a pool (without storage of its own), and data can be retrieved referring to an index in the pool, specifying offset and length. Checks ensure data is not read outside the boundaries of buffer and descriptors, and that packets added to a pool are within the buffer range with valid offset and indices. This implies a wider rework: usage of the "queueing" part of the abstraction mostly affects tap_handler_{passt,pasta}() functions and their callees, while the "fetching" part affects all the guest or tap facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6 handlers. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	be5bbb9b06	tcp: Rework timers to use timerfd instead of periodic bitmap scan With a lot of concurrent connections, the bitmap scan approach is not really sustainable. Switch to per-connection timerfd timers, set based on events and on two new flags, ACK_FROM_TAP_DUE and ACK_TO_TAP_DUE. Timers are added to the common epoll list, and implement the existing timeouts. While at it, drop the CONN_ prefix from flag names, otherwise they get quite long, and fix the logic to decide if a connection has a local, possibly unreachable endpoint: we shouldn't go through the rest of tcp_conn_from_tap() if we reset the connection due to a successful bind(2), and we'll get EACCES if the port number is low. Suggested by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-29 15:35:38 +02:00
Stefano Brivio	5ca555cf78	dhcpv6, tap, tcp: Use IN6_ARE_ADDR_EQUAL instead of open-coded memcmp() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-28 17:11:40 +02:00
Stefano Brivio	d2e40bb8d9	conf, util, tap: Implement --trace option for extra verbose logging --debug can be a bit too noisy, especially as single packets or socket messages are logged: implement a new option, --trace, implying --debug, that enables all debug messages. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-03-25 13:21:13 +01:00
Stefano Brivio	bd7340e815	tap: Cast ETH_MAX_MTU to signed in comparisons At least gcc 8.3 and 10.2 emit a warning on armv6l and armv7l. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-26 23:32:50 +01:00
Stefano Brivio	5e0c75d609	passt: Drop PASST_LEGACY_NO_OPTIONS sections ...nobody uses those builds anymore. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-22 18:42:51 +01:00
Stefano Brivio	0515adceaa	passt, pasta: Namespace-based sandboxing, defer seccomp policy application To reach (at least) a conceptually equivalent security level as implemented by --enable-sandbox in slirp4netns, we need to create a new mount namespace and pivot_root() into a new (empty) mountpoint, so that passt and pasta can't access any filesystem resource after initialisation. While at it, also detach IPC, PID (only for passt, to prevent vulnerabilities based on the knowledge of a target PID), and UTS namespaces. With this approach, if we apply the seccomp filters right after the configuration step, the number of allowed syscalls grows further. To prevent this, defer the application of seccomp policies after the initialisation phase, before the main loop, that's where we expect bad things to happen, potentially. This way, we get back to 22 allowed syscalls for passt and 34 for pasta, on x86_64. While at it, move #syscalls notes to specific code paths wherever it conceptually makes sense. We have to open all the file handles we'll ever need before sandboxing: - the packet capture file can only be opened once, drop instance numbers from the default path and use the (pre-sandbox) PID instead - /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection of bound ports in pasta mode, are now opened only once, before sandboxing, and their handles are stored in the execution context - the UNIX domain socket for passt is also bound only once, before sandboxing: to reject clients after the first one, instead of closing the listening socket, keep it open, accept and immediately discard new connection if we already have a valid one Clarify the (unchanged) behaviour for --netns-only in the man page. To actually make passt and pasta processes run in a separate PID namespace, we need to unshare(CLONE_NEWPID) before forking to background (if configured to do so). Introduce a small daemon() implementation, __daemon(), that additionally saves the PID file before forking. While running in foreground, the process itself can't move to a new PID namespace (a process can't change the notion of its own PID): mention that in the man page. For some reason, fork() in a detached PID namespace causes SIGTERM and SIGQUIT to be ignored, even if the handler is still reported as SIG_DFL: add a signal handler that just exits. We can now drop most of the pasta_child_handler() implementation, that took care of terminating all processes running in the same namespace, if pasta started a shell: the shell itself is now the init process in that namespace, and all children will terminate once the init process exits. Issuing 'echo $$' in a detached PID namespace won't return the actual namespace PID as seen from the init namespace: adapt demo and test setup scripts to reflect that. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-02-21 13:41:13 +01:00
Stefano Brivio	292c185553	passt: Address new clang-tidy warnings from LLVM 13.0.1 clang-tidy from LLVM 13.0.1 reports some new warnings from these checkers: - altera-unroll-loops, altera-id-dependent-backward-branch: ignore for the moment being, add a TODO item - bugprone-easily-swappable-parameters: ignore, nothing to do about those - readability-function-cognitive-complexity: ignore for the moment being, add a TODO item - altera-struct-pack-align: ignore, alignment is forced in protocol headers - concurrency-mt-unsafe: ignore for the moment being, add a TODO item Fix bugprone-implicit-widening-of-multiplication-result warnings, though, that's doable and they seem to make sense. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-30 02:59:12 +01:00
Stefano Brivio	34e6429235	passt, tap: Daemonise once socket is ready without waiting for connection The existing behaviour is not really practical: an automated agent in charge of starting both qemu and passt would need to fork itself to start passt, because passt won't fork to background until qemu connects, and the agent needs to unblock to start qemu. Instead of waiting for a connection to daemonise, do it right away as soon as a socket is available: that can be considered an initialised state already. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-28 18:51:50 +01:00
Stefano Brivio	33b1bdd079	seccomp: Add a number of alternate and per-arch syscalls Depending on the C library, but not necessarily in all the functions we use, statx() might be used instead of stat(), getdents() instead of getdents64(), readlinkat() instead of readlink(), openat() instead of open(). On aarch64, it's clone() and not fork(), and dup3() instead of dup2() -- just allow the existing alternative instead of dealing with per-arch selections. Since glibc commit 9a7565403758 ("posix: Consolidate fork implementation"), we need to allow set_robust_list() for fork()/clone(), even in a single-threaded context. On some architectures, epoll_pwait() is provided instead of epoll_wait(), but never both. Same with newfstat() and fstat(), sigreturn() and rt_sigreturn(), getdents64() and getdents(), readlink() and readlinkat(), unlink() and unlinkat(), whereas pipe() might not be available, but pipe2() always is, exclusively or not. Seen on Fedora 34: newfstatat() is used on top of fstat(). syslog() is an actual system call on some glibc/arch combinations, instead of a connect()/send() implementation. On ppc64 and ppc64le, _llseek(), recv(), send() and getuid() are used. For ppc64 only: ugetrlimit() for the getrlimit() implementation, plus sigreturn() and fcntl64(). On s390x, additionally, we need to allow socketcall() (on top of socket()), and sigreturn() also for passt (not just for pasta). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-26 16:30:59 +01:00
Stefano Brivio	b93c2c1713	passt: Drop <linux/ipv6.h> include, carry own ipv6hdr and opt_hdr definitions This is the only remaining Linux-specific include -- drop it to avoid clang-tidy warnings and to make code more portable. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-26 07:57:09 +01:00
Stefano Brivio	f6d9787d30	tap, tcp: Fix two comparisons with different signedness reported by gcc 7 For some reason, those are not reported by recent versions of gcc. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2022-01-26 07:57:09 +01:00
Stefano Brivio	627e18fa8a	passt: Add cppcheck target, test, and address resulting warnings ...mostly false positives, but a number of very relevant ones too, in tcp_get_sndbuf(), tcp_conn_from_tap(), and siphash PREAMBLE(). Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 09:41:13 +02:00
Stefano Brivio	dd942eaa48	passt: Fix build with gcc 7, use std=c99, enable some more Clang checkers Unions and structs, you all have names now. Take the chance to enable bugprone-reserved-identifier, cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy. Provide a ffsl() weak declaration using gcc built-in. Start reordering includes, but that's not enough for the llvm-include-order checker yet. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-21 04:26:08 +02:00
Stefano Brivio	66d5930ec7	passt, pasta: Add seccomp support List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-14 13:15:46 +02:00
Stefano Brivio	675174d4ba	conf, tap: Split netlink and pasta functions, allow interface configuration Move netlink routines to their own file, and use netlink to configure or fetch all the information we need, except for the TUNSETIFF ioctl. Move pasta-specific functions to their own file as well, add parameters and calls to configure the tap interface in the namespace. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-14 13:15:12 +02:00
Giuseppe Scrivano	9a175cc2ce	pasta: Allow specifying paths and names of namespaces Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-07 04:05:15 +02:00
Stefano Brivio	d4d61480b6	tcp, tap: Turn tcp_probe_mem() into sock_probe_mem(), use for AF_UNIX socket too Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-05 20:02:03 +02:00

1 2 3 4

168 Commits