passt

mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00

Author	SHA1	Message	Date
Stefano Brivio	66d5930ec7	passt, pasta: Add seccomp support List of allowed syscalls comes from comments in the form: #syscalls <list> for syscalls needed both in passt and pasta mode, and: #syscalls:pasta <list> #syscalls:passt <list> for syscalls specifically needed in pasta or passt mode only. seccomp.sh builds a list of BPF statements from those comments, prefixed by a binary search tree to keep lookup fast. While at it, clean up a bit the Makefile using wildcards. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-14 13:15:46 +02:00
Giuseppe Scrivano	9a175cc2ce	pasta: Allow specifying paths and names of namespaces Based on a patch from Giuseppe Scrivano, this adds the ability to: - specify paths and names of target namespaces to join, instead of a PID, also for user namespaces, with --userns - request to join or create a network namespace only, without entering or creating a user namespace, with --netns-only - specify the base directory for netns mountpoints, with --nsrun-dir Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [sbrivio: reworked logic to actually join the given namespaces when they're not created, implemented --netns-only and --nsrun-dir, updated pasta demo script and man page] Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-10-07 04:05:15 +02:00
Stefano Brivio	dd581730e5	tap: Completely de-serialise input message batches Until now, messages would be passed to protocol handlers in a single batch only if they happened to be dequeued in a row. Packets interleaved between different connections would result in multiple calls to the same protocol handler for a single connection. Instead, keep track of incoming packet descriptors, arrange them in sequences, and call protocol handlers only as we completely sorted input messages in batches. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	ec0bdc10b1	udp: Switch to new socket message after 32KiB instead of 64KiB For some reason, this measurably improves performance with qemu and virtio-net. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	c2d86b7475	udp: Decrease UDP_TAP_FRAMES to 16 Similarly to the decrease in TCP_TAP_FRAMES, this improves fairness, with a very small impact on performance. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-27 01:28:02 +02:00
Stefano Brivio	3994fc8f58	udp: Reset iov_base after sending partial message on sendmmsg() failure We set the length while processing messges, but the starting address is pre-initialised. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	ecf1f97564	udp: Fix comparison of seen IPv4 address for local connections c->addr4_seen is stored in network order. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	8e9333616a	udp: Fix retry mechanism on partial sendmmsg() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-09 15:40:04 +02:00
Stefano Brivio	647a413794	tcp, udp: Restore usage of gateway for guest to connect to local host This went lost in a recent rework: if the guest wants to connect directly to the host, it can use the address of the default gateway. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	77d4efa236	udp: Handle partial failure in sendmmsg() to UNIX domain socket Similarly to the handling introduced by commit "tcp: Proper error handling for sendmmsg() to UNIX domain socket" for TCP, we need to deal with partial sendmmsg() failures for UDP as well. Here, we can lose messages, but we need to make sure that the last message is delivered completely, otherwise qemu will fail to reassemble further packets. For UDP, this is somewhat complicated by the fact that one message might include multiple datagrams, and we need to respect message boundaries: go through headers, and calculate what we need to re-send, if anything. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	1e49d194d0	passt, pasta: Introduce command-line options and port re-mapping Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-09-01 17:00:27 +02:00
Stefano Brivio	8af961b85b	tcp, udp: Map source address to gateway for any traffic from 127.0.0.0/8 ...instead of just 127.0.0.1. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 18:20:01 +02:00
Stefano Brivio	86b273150a	tcp, udp: Allow binding ports in init namespace to both tap and loopback Traffic with loopback source address will be forwarded to the direct loopback connection in the namespace, and the tap interface is used for the rest. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 14:10:29 +02:00
Stefano Brivio	17765f8de0	checksum: Introduce AVX2 implementation, unify helpers Provide an AVX2-based function using compiler intrinsics for TCP/IP-style checksums. The load/unpack/add idea and implementation is largely based on code from BESS (the Berkeley Extensible Software Switch) licensed as 3-Clause BSD, with a number of modifications to further decrease pipeline stalls and to minimise cache pollution. This speeds up considerably data paths from sockets to tap interfaces, decreasing overhead for checksum computation, with 16-64KiB packet buffers, from approximately 11% to 7%. The rest is just syscalls at this point. While at it, provide convenience targets in the Makefile for avx2, avx2_debug, and debug targets -- these simply add target-specific CFLAGS to the build. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-26 07:18:50 +02:00
Stefano Brivio	49631a38a6	tcp, udp: Split IPv4 and IPv6 bound port sets Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately. Port numbers of TCP ports that are bound in a namespace are also bound for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound if the corresponding IPv6 port is bound (socket might not have the IPV6_V6ONLY option set). This will also be configurable later. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-21 17:44:39 +02:00
Stefano Brivio	64a0ba3b27	udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket Packets are received directly onto pre-cooked, static buffers for IPv4 (with partial checksum pre-calculation) and IPv6 frames, with pre-filled Ethernet addresses and, partially, IP headers, and sent out from the same buffers with sendmmsg(), for both passt and pasta (non-local traffic only) modes. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-21 12:01:04 +02:00
Stefano Brivio	33482d5bf2	passt: Add PASTA mode, major rework PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host connectivity to an otherwise disconnected, unprivileged network and user namespace, similarly to slirp4netns. Given that the implementation is largely overlapping with PASST, no separate binary is built: 'pasta' (and 'passt4netns' for clarity) both link to 'passt', and the mode of operation is selected depending on how the binary is invoked. Usage example: $ unshare -rUn # echo $$ 1871759 $ ./pasta 1871759 # From another terminal # udhcpc -i pasta0 2>/dev/null # ping -c1 pasta.pizza PING pasta.pizza (64.190.62.111) 56(84) bytes of data. 64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms --- pasta.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms # ping -c1 spaghetti.pizza PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes 64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms --- spaghetti.pizza ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms This entails a major rework, especially with regard to the storage of tracked connections and to the semantics of epoll(7) references. Indexing TCP and UDP bindings merely by socket proved to be inflexible and unsuitable to handle different connection flows: pasta also provides Layer-2 to Layer-2 socket mapping between init and a separate namespace for local connections, using a pair of splice() system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local bindings. For instance, building on the previous example: # ip link set dev lo up # iperf3 -s $ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 \| tail -n4 [SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender [SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver iperf Done. epoll(7) references now include a generic part in order to demultiplex data to the relevant protocol handler, using 24 bits for the socket number, and an opaque portion reserved for usage by the single protocol handlers, in order to track sockets back to corresponding connections and bindings. A number of fixes pertaining to TCP state machine and congestion window handling are also included here. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-07-17 11:04:22 +02:00
Stefano Brivio	e07f539ae0	udp, passt: Introduce socket packet buffer, avoid getsockname() for UDP This is in preparation for scatter-gather IO on the UDP receive path: save a getsockname() syscall by setting a flag if we get the numbering of all bound sockets in a strict sequence (expected, in practice) and repurpose the tap buffer to be also a socket receive buffer, passing it down to protocol handlers. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-30 14:52:18 +02:00
Stefano Brivio	605af213c5	udp: Connection tracking for ephemeral, local ports, and related fixes As we support UDP forwarding for packets that are sent to local ports, we actually need some kind of connection tracking for UDP. While at it, this commit introduces a number of vaguely related fixes for issues observed while trying this out. In detail: - implement an explicit, albeit minimalistic, connection tracking for UDP, to allow usage of ephemeral ports by the guest and by the host at the same time, by binding them dynamically as needed, and to allow mapping address changes for packets with a loopback address as destination - set the guest MAC address whenever we receive a packet from tap instead of waiting for an ARP request, and set it to broadcast on start, otherwise DHCPv6 might not work if all DHCPv6 requests time out before the guest starts talking IPv4 - split context IPv6 address into address we assign, global or site address seen on tap, and link-local address seen on tap, and make sure we use the addresses we've seen as destination (link-local choice depends on source address). Similarly, for IPv4, split into address we assign and address we observe, and use the address we observe as destination - introduce a clock_gettime() syscall right after epoll_wait() wakes up, so that we can remove all the other ones and pass the current timestamp to tap and socket handlers -- this is additionally needed by UDP to time out bindings to ephemeral ports and mappings between loopback address and a local address - rename sock_l4_add() to sock_l4(), no semantic changes intended - include <arpa/inet.h> in passt.c before kernel headers so that we can use <netinet/in.h> macros to check IPv6 address types, and remove a duplicate <linux/ip.h> inclusion Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-29 17:15:26 +02:00
Stefano Brivio	b3b3451ae2	udp: Disable SO_ZEROCOPY again ...on a second thought, this won't really help with veth, and actually causes a significant overhead as we get EPOLLERR whenever another process is tapping on the traffic. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-25 10:41:55 +02:00
Stefano Brivio	38b50dba47	passt: Spare some syscalls, add some optimisations from profiling Avoid a bunch of syscalls on forwarding paths by: - storing minimum and maximum file descriptor numbers for each protocol, fall back to SO_PROTOCOL query only on overlaps - allocating a larger receive buffer -- this can result in more coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024), so make sure we don't exceed that within a single call to protocol tap handlers - nesting the handling loop in tap_handler() in the receive loop, so that we have better chances of filling our receive buffer in fewer calls - skipping the recvfrom() in the UDP handler on EPOLLERR -- there's nothing to be done in that case and while at it: - restore the 20ms timer interval for periodic (TCP) events, I accidentally changed that to 100ms in an earlier commit - attempt using SO_ZEROCOPY for UDP -- if it's not available, sendmmsg() will succeed anyway - fix the handling of the status code from sendmmsg(), if it fails, we'll try to discard the first message, hence return 1 from the UDP handler Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-23 22:22:37 +02:00
Stefano Brivio	6488c3e848	tcp, udp: Replace loopback source address by gateway address This is symmetric with tap operation and addressing model, and allows again to reach the guest behind the tap interface by contacting the local address. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-22 17:03:43 +02:00
Stefano Brivio	1f7cf04d34	passt: Introduce packet batching mechanism Receive packets in batches from AF_UNIX, check if they can be sent with a single syscall, and batch them up with sendmmsg() in case. A bit rudimentary, currently only implemented for UDP, but it seems to work. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-04-22 13:39:36 +02:00
Stefano Brivio	f435e38927	udp: Fix typo in tcp_tap_handler() documentation Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-03-17 10:57:42 +01:00
Stefano Brivio	93977868f9	udp: Use size_t for return value of recvfrom() Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-03-17 10:57:42 +01:00
Stefano Brivio	8bca388e8a	passt: Assorted fixes from "fresh eyes" review A bunch of fixes not worth single commits at this stage, notably: - make buffer, length parameter ordering consistent in ARP, DHCP, NDP handlers - strict checking of buffer, message and option length in DHCP handler (a malicious client could have easily crashed it) - set up forwarding for IPv4 and IPv6, and masquerading with nft for IPv4, from demo script - get rid of separate slow and fast timers, we don't save any overhead that way - stricter checking of buffer lengths as passed to tap handlers - proper dequeuing from qemu socket back-end: I accidentally trashed messages that were bundled up together in a single tap read operation -- the length header tells us what's the size of the next frame, but there's no apparent limit to the number of messages we get with one single receive - rework some bits of the TCP state machine, now passive and active connection closes appear to be robust -- introduce a new FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag from socket - streamline TCP option parsing routine - track TCP state changes to stderr (this is temporary, proper debugging and syslogging support pending) - observe that multiplying a number by four might very well change its value, and this happens to be the case for the data offset from the TCP header as we check if it's the same as the total length to find out if it's a duplicated ACK segment - recent estimates suggest that the duration of a millisecond is closer to a million nanoseconds than a thousand of them, this trend is now reflected into the timespec_diff_ms() convenience routine Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-02-21 11:55:49 +01:00
Stefano Brivio	105b916361	passt: New design and implementation with native Layer 4 sockets This is a reimplementation, partially building on the earlier draft, that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW, providing L4-L2 translation functionality without requiring any security capability. Conceptually, this follows the design presented at: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md The most significant novelty here comes from TCP and UDP translation layers. In particular, the TCP state and translation logic follows the intent of being minimalistic, without reimplementing a full TCP stack in either direction, and synchronising as much as possible the TCP dynamic and flows between guest and host kernel. Another important introduction concerns addressing, port translation and forwarding. The Layer 4 implementations now attempt to bind on all unbound ports, in order to forward connections in a transparent way. While at it: - the qemu 'tap' back-end can't be used as-is by qrap anymore, because of explicit checks now introduced in qemu to ensure that the corresponding file descriptor is actually a tap device. For this reason, qrap now operates on a 'socket' back-end type, accounting for and building the additional header reporting frame length - provide a demo script that sets up namespaces, addresses and routes, and starts the daemon. A virtual machine started in the network namespace, wrapped by qrap, will now directly interface with passt and communicate using Layer 4 sockets provided by the host kernel. Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2021-02-16 09:28:55 +01:00

1 2 3 4 5

227 Commits