passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-05 18:11:44 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0-or-later
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 06:25:09 +00:00
|
|
|
|
2020-07-20 14:41:49 +00:00
|
|
|
/* PASST - Plug A Simple Socket Transport
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
* for qemu/UNIX domain socket mode
|
|
|
|
*
|
|
|
|
* PASTA - Pack A Subtle Tap Abstraction
|
|
|
|
* for network namespace/tap device mode
|
2020-07-17 23:02:39 +00:00
|
|
|
*
|
2020-07-20 14:41:49 +00:00
|
|
|
* passt.c - Daemon implementation
|
2020-07-13 20:55:46 +00:00
|
|
|
*
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 06:25:09 +00:00
|
|
|
* Copyright (c) 2020-2021 Red Hat GmbH
|
2020-07-13 20:55:46 +00:00
|
|
|
* Author: Stefano Brivio <sbrivio@redhat.com>
|
|
|
|
*
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
* Grab Ethernet frames from AF_UNIX socket (in "passt" mode) or tap device (in
|
|
|
|
* "pasta" mode), build SOCK_DGRAM/SOCK_STREAM sockets for each 5-tuple from
|
|
|
|
* TCP, UDP packets, perform connection tracking and forward them. Forward
|
|
|
|
* packets received on sockets back to the UNIX domain socket (typically, a
|
|
|
|
* socket virtio_net file descriptor from qemu) or to the tap device (typically,
|
|
|
|
* created in a separate network namespace).
|
2020-07-13 20:55:46 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/epoll.h>
|
2021-08-12 13:42:43 +00:00
|
|
|
#include <fcntl.h>
|
2021-09-26 21:19:40 +00:00
|
|
|
#include <sys/mman.h>
|
2021-03-18 10:36:55 +00:00
|
|
|
#include <sys/resource.h>
|
2022-09-12 12:24:07 +00:00
|
|
|
#include <stdbool.h>
|
2020-07-13 20:55:46 +00:00
|
|
|
#include <stdlib.h>
|
|
|
|
#include <unistd.h>
|
|
|
|
#include <netdb.h>
|
2023-03-08 03:00:22 +00:00
|
|
|
#include <signal.h>
|
|
|
|
#include <stdio.h>
|
2020-07-13 20:55:46 +00:00
|
|
|
#include <string.h>
|
|
|
|
#include <errno.h>
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 06:25:09 +00:00
|
|
|
#include <time.h>
|
2021-03-18 06:49:08 +00:00
|
|
|
#include <syslog.h>
|
2021-10-13 20:25:03 +00:00
|
|
|
#include <sys/prctl.h>
|
2021-10-21 02:26:08 +00:00
|
|
|
#include <netinet/if_ether.h>
|
2023-11-30 02:02:21 +00:00
|
|
|
#ifdef HAS_GETRANDOM
|
|
|
|
#include <sys/random.h>
|
|
|
|
#endif
|
2021-10-21 02:26:08 +00:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
#include "util.h"
|
2020-07-20 14:41:49 +00:00
|
|
|
#include "passt.h"
|
2021-10-05 19:15:01 +00:00
|
|
|
#include "dhcp.h"
|
2021-04-13 19:59:47 +00:00
|
|
|
#include "dhcpv6.h"
|
2022-09-12 12:24:03 +00:00
|
|
|
#include "isolation.h"
|
2021-05-21 09:14:48 +00:00
|
|
|
#include "pcap.h"
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
#include "tap.h"
|
2021-08-12 13:42:43 +00:00
|
|
|
#include "conf.h"
|
2021-10-11 10:01:31 +00:00
|
|
|
#include "pasta.h"
|
2022-02-28 15:18:44 +00:00
|
|
|
#include "arch.h"
|
2022-09-24 07:53:15 +00:00
|
|
|
#include "log.h"
|
2024-01-16 00:50:38 +00:00
|
|
|
#include "tcp_splice.h"
|
2020-07-17 23:02:39 +00:00
|
|
|
|
2021-09-26 21:19:40 +00:00
|
|
|
#define EPOLL_EVENTS 8
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 06:25:09 +00:00
|
|
|
|
2024-01-16 00:50:36 +00:00
|
|
|
#define TIMER_INTERVAL__ MIN(TCP_TIMER_INTERVAL, UDP_TIMER_INTERVAL)
|
|
|
|
#define TIMER_INTERVAL_ MIN(TIMER_INTERVAL__, ICMP_TIMER_INTERVAL)
|
|
|
|
#define TIMER_INTERVAL MIN(TIMER_INTERVAL_, FLOW_TIMER_INTERVAL)
|
2021-04-30 12:52:18 +00:00
|
|
|
|
2021-09-26 21:19:40 +00:00
|
|
|
char pkt_buf[PKT_BUF_BYTES] __attribute__ ((aligned(PAGE_SIZE)));
|
2020-07-13 20:55:46 +00:00
|
|
|
|
2024-01-16 00:50:37 +00:00
|
|
|
char *epoll_type_str[] = {
|
2023-08-11 05:12:27 +00:00
|
|
|
[EPOLL_TYPE_TCP] = "connected TCP socket",
|
2024-01-16 00:50:38 +00:00
|
|
|
[EPOLL_TYPE_TCP_SPLICE] = "connected spliced TCP socket",
|
2023-08-11 05:12:27 +00:00
|
|
|
[EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket",
|
2023-08-11 05:12:26 +00:00
|
|
|
[EPOLL_TYPE_TCP_TIMER] = "TCP timer",
|
2023-08-11 05:12:21 +00:00
|
|
|
[EPOLL_TYPE_UDP] = "UDP socket",
|
|
|
|
[EPOLL_TYPE_ICMP] = "ICMP socket",
|
|
|
|
[EPOLL_TYPE_ICMPV6] = "ICMPv6 socket",
|
2023-08-11 05:12:22 +00:00
|
|
|
[EPOLL_TYPE_NSQUIT] = "namespace inotify",
|
2023-08-11 05:12:29 +00:00
|
|
|
[EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device",
|
|
|
|
[EPOLL_TYPE_TAP_PASST] = "connected qemu socket",
|
2023-08-11 05:12:28 +00:00
|
|
|
[EPOLL_TYPE_TAP_LISTEN] = "listening qemu socket",
|
2021-05-10 05:50:35 +00:00
|
|
|
};
|
2024-01-16 00:50:37 +00:00
|
|
|
static_assert(ARRAY_SIZE(epoll_type_str) == EPOLL_NUM_TYPES,
|
|
|
|
"epoll_type_str[] doesn't match enum epoll_type");
|
2020-07-21 08:48:24 +00:00
|
|
|
|
|
|
|
/**
|
2021-10-05 17:20:26 +00:00
|
|
|
* post_handler() - Run periodic and deferred tasks for L4 protocol handlers
|
2020-07-21 08:48:24 +00:00
|
|
|
* @c: Execution context
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 14:59:20 +00:00
|
|
|
* @now: Current timestamp
|
2020-07-21 08:48:24 +00:00
|
|
|
*/
|
2022-03-26 06:23:21 +00:00
|
|
|
static void post_handler(struct ctx *c, const struct timespec *now)
|
2020-07-21 08:48:24 +00:00
|
|
|
{
|
2021-10-05 17:20:26 +00:00
|
|
|
#define CALL_PROTO_HANDLER(c, now, lc, uc) \
|
|
|
|
do { \
|
|
|
|
extern void \
|
2022-03-18 11:18:19 +00:00
|
|
|
lc ## _defer_handler (struct ctx *c) \
|
2021-10-05 17:20:26 +00:00
|
|
|
__attribute__ ((weak)); \
|
|
|
|
\
|
|
|
|
if (!c->no_ ## lc) { \
|
|
|
|
if (lc ## _defer_handler) \
|
2022-03-18 11:18:19 +00:00
|
|
|
lc ## _defer_handler(c); \
|
2021-10-05 17:20:26 +00:00
|
|
|
\
|
|
|
|
if (timespec_diff_ms((now), &c->lc.timer_run) \
|
|
|
|
>= uc ## _TIMER_INTERVAL) { \
|
|
|
|
lc ## _timer(c, now); \
|
|
|
|
c->lc.timer_run = *now; \
|
|
|
|
} \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
2022-03-18 11:18:19 +00:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2021-10-05 17:20:26 +00:00
|
|
|
CALL_PROTO_HANDLER(c, now, tcp, TCP);
|
2022-03-18 11:18:19 +00:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2021-10-05 17:20:26 +00:00
|
|
|
CALL_PROTO_HANDLER(c, now, udp, UDP);
|
2022-03-18 11:18:19 +00:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2021-10-05 17:20:26 +00:00
|
|
|
CALL_PROTO_HANDLER(c, now, icmp, ICMP);
|
|
|
|
|
2024-01-16 00:50:36 +00:00
|
|
|
flow_defer_handler(c, now);
|
2021-10-05 17:20:26 +00:00
|
|
|
#undef CALL_PROTO_HANDLER
|
2020-07-21 08:48:24 +00:00
|
|
|
}
|
|
|
|
|
2023-11-30 02:02:21 +00:00
|
|
|
/**
|
|
|
|
* secret_init() - Create secret value for SipHash calculations
|
|
|
|
* @c: Execution context
|
|
|
|
*/
|
|
|
|
static void secret_init(struct ctx *c)
|
|
|
|
{
|
|
|
|
#ifndef HAS_GETRANDOM
|
|
|
|
int dev_random = open("/dev/random", O_RDONLY);
|
|
|
|
unsigned int random_read = 0;
|
|
|
|
|
|
|
|
while (dev_random && random_read < sizeof(c->hash_secret)) {
|
|
|
|
int ret = read(dev_random,
|
|
|
|
(uint8_t *)&c->hash_secret + random_read,
|
|
|
|
sizeof(c->hash_secret) - random_read);
|
|
|
|
|
|
|
|
if (ret == -1 && errno == EINTR)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (ret <= 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
random_read += ret;
|
|
|
|
}
|
|
|
|
if (dev_random >= 0)
|
|
|
|
close(dev_random);
|
|
|
|
if (random_read < sizeof(c->hash_secret)) {
|
|
|
|
#else
|
|
|
|
if (getrandom(&c->hash_secret, sizeof(c->hash_secret),
|
|
|
|
GRND_RANDOM) < 0) {
|
|
|
|
#endif /* !HAS_GETRANDOM */
|
|
|
|
perror("TCP initial sequence getrandom");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-09-26 22:16:51 +00:00
|
|
|
/**
|
|
|
|
* timer_init() - Set initial timestamp for timer runs to current time
|
|
|
|
* @c: Execution context
|
|
|
|
* @now: Current timestamp
|
|
|
|
*/
|
2022-03-26 06:23:21 +00:00
|
|
|
static void timer_init(struct ctx *c, const struct timespec *now)
|
2021-09-26 22:16:51 +00:00
|
|
|
{
|
|
|
|
c->tcp.timer_run = c->udp.timer_run = c->icmp.timer_run = *now;
|
|
|
|
}
|
|
|
|
|
2021-07-21 10:01:04 +00:00
|
|
|
/**
|
|
|
|
* proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers
|
|
|
|
* @eth_d: Ethernet destination address, NULL if unchanged
|
|
|
|
* @eth_s: Ethernet source address, NULL if unchanged
|
|
|
|
*/
|
2023-08-22 05:29:57 +00:00
|
|
|
void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s)
|
2021-07-21 10:01:04 +00:00
|
|
|
{
|
2023-08-22 05:29:57 +00:00
|
|
|
tcp_update_l2_buf(eth_d, eth_s);
|
|
|
|
udp_update_l2_buf(eth_d, eth_s);
|
2021-07-21 10:01:04 +00:00
|
|
|
}
|
|
|
|
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
/**
|
|
|
|
* exit_handler() - Signal handler for SIGQUIT and SIGTERM
|
|
|
|
* @unused: Unused, handler deals with SIGQUIT and SIGTERM only
|
|
|
|
*
|
|
|
|
* TODO: After unsharing the PID namespace and forking, SIG_DFL for SIGTERM and
|
|
|
|
* SIGQUIT unexpectedly doesn't cause the process to terminate, figure out why.
|
2022-07-13 01:36:09 +00:00
|
|
|
*
|
|
|
|
* #syscalls exit_group
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
*/
|
|
|
|
void exit_handler(int signal)
|
|
|
|
{
|
|
|
|
(void)signal;
|
|
|
|
|
|
|
|
exit(EXIT_SUCCESS);
|
2021-10-14 10:17:47 +00:00
|
|
|
}
|
|
|
|
|
2020-07-13 20:55:46 +00:00
|
|
|
/**
|
|
|
|
* main() - Entry point and main loop
|
|
|
|
* @argc: Argument count
|
2021-08-12 13:42:43 +00:00
|
|
|
* @argv: Options, plus optional target PID for pasta mode
|
2020-07-13 20:55:46 +00:00
|
|
|
*
|
2021-10-21 07:41:13 +00:00
|
|
|
* Return: non-zero on failure
|
2021-10-13 20:25:03 +00:00
|
|
|
*
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
* #syscalls read write writev
|
|
|
|
* #syscalls socket bind connect getsockopt setsockopt s390x:socketcall close
|
2022-02-26 22:39:19 +00:00
|
|
|
* #syscalls recvfrom sendto shutdown
|
|
|
|
* #syscalls armv6l:recv armv7l:recv ppc64le:recv
|
|
|
|
* #syscalls armv6l:send armv7l:send ppc64le:send
|
|
|
|
* #syscalls accept4|accept listen epoll_ctl epoll_wait|epoll_pwait epoll_pwait
|
|
|
|
* #syscalls clock_gettime armv6l:clock_gettime64 armv7l:clock_gettime64
|
2020-07-13 20:55:46 +00:00
|
|
|
*/
|
|
|
|
int main(int argc, char **argv)
|
|
|
|
{
|
2022-02-18 15:12:11 +00:00
|
|
|
int nfds, i, devnull_fd = -1, pidfile_fd = -1, quit_fd;
|
2020-07-21 08:48:24 +00:00
|
|
|
struct epoll_event events[EPOLL_EVENTS];
|
2022-07-13 01:20:45 +00:00
|
|
|
char *log_name, argv0[PATH_MAX], *name;
|
2020-07-13 20:55:46 +00:00
|
|
|
struct ctx c = { 0 };
|
2021-03-18 10:36:55 +00:00
|
|
|
struct rlimit limit;
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 14:59:20 +00:00
|
|
|
struct timespec now;
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
struct sigaction sa;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
|
2022-02-28 15:18:44 +00:00
|
|
|
arch_avx2_exec(argv);
|
|
|
|
|
2022-10-14 04:25:31 +00:00
|
|
|
isolate_initial();
|
2021-10-14 00:47:03 +00:00
|
|
|
|
2022-09-12 12:24:07 +00:00
|
|
|
c.pasta_netns_fd = c.fd_tap = c.fd_tap_listen = -1;
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
|
|
|
|
sigemptyset(&sa.sa_mask);
|
|
|
|
sa.sa_flags = 0;
|
|
|
|
sa.sa_handler = exit_handler;
|
|
|
|
sigaction(SIGTERM, &sa, NULL);
|
|
|
|
sigaction(SIGQUIT, &sa, NULL);
|
2021-09-14 22:28:40 +00:00
|
|
|
|
2022-02-17 00:41:28 +00:00
|
|
|
if (argc < 1)
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
|
2022-07-13 01:20:45 +00:00
|
|
|
strncpy(argv0, argv[0], PATH_MAX - 1);
|
|
|
|
name = basename(argv0);
|
|
|
|
if (strstr(name, "pasta")) {
|
2021-09-14 22:28:40 +00:00
|
|
|
sa.sa_handler = pasta_child_handler;
|
2023-04-13 17:32:13 +00:00
|
|
|
if (sigaction(SIGCHLD, &sa, NULL)) {
|
|
|
|
die("Couldn't install signal handlers: %s",
|
|
|
|
strerror(errno));
|
|
|
|
}
|
|
|
|
|
|
|
|
if (signal(SIGPIPE, SIG_IGN) == SIG_ERR) {
|
|
|
|
die("Couldn't set disposition for SIGPIPE: %s",
|
|
|
|
strerror(errno));
|
|
|
|
}
|
2021-09-14 22:28:40 +00:00
|
|
|
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
c.mode = MODE_PASTA;
|
|
|
|
log_name = "pasta";
|
2022-07-13 01:20:45 +00:00
|
|
|
} else if (strstr(name, "passt")) {
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
c.mode = MODE_PASST;
|
|
|
|
log_name = "passt";
|
2022-02-17 00:41:28 +00:00
|
|
|
} else {
|
|
|
|
exit(EXIT_FAILURE);
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
}
|
2020-07-13 20:55:46 +00:00
|
|
|
|
2022-02-26 22:37:05 +00:00
|
|
|
madvise(pkt_buf, TAP_BUF_BYTES, MADV_HUGEPAGE);
|
2021-09-26 21:19:40 +00:00
|
|
|
|
2021-10-13 23:21:29 +00:00
|
|
|
__openlog(log_name, 0, LOG_DAEMON);
|
2021-08-12 13:42:43 +00:00
|
|
|
|
2022-08-23 06:31:51 +00:00
|
|
|
c.epollfd = epoll_create1(EPOLL_CLOEXEC);
|
2020-07-13 20:55:46 +00:00
|
|
|
if (c.epollfd == -1) {
|
|
|
|
perror("epoll_create1");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
2021-03-18 10:36:55 +00:00
|
|
|
if (getrlimit(RLIMIT_NOFILE, &limit)) {
|
|
|
|
perror("getrlimit");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
2022-03-18 23:33:46 +00:00
|
|
|
c.nofile = limit.rlim_cur = limit.rlim_max;
|
2021-03-18 10:36:55 +00:00
|
|
|
if (setrlimit(RLIMIT_NOFILE, &limit)) {
|
|
|
|
perror("setrlimit");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
2021-10-05 17:27:04 +00:00
|
|
|
sock_probe_mem(&c);
|
2021-03-18 10:36:55 +00:00
|
|
|
|
2022-05-01 04:36:34 +00:00
|
|
|
conf(&c, argc, argv);
|
|
|
|
trace_init(c.trace);
|
|
|
|
|
2023-03-08 02:47:45 +00:00
|
|
|
if (c.force_stderr || isatty(fileno(stdout)))
|
2022-05-01 04:36:34 +00:00
|
|
|
__openlog(log_name, LOG_PERROR, LOG_DAEMON);
|
|
|
|
|
|
|
|
quit_fd = pasta_netns_quit_init(&c);
|
|
|
|
|
2021-08-12 13:42:43 +00:00
|
|
|
tap_sock_init(&c);
|
|
|
|
|
2023-11-30 02:02:21 +00:00
|
|
|
secret_init(&c);
|
|
|
|
|
2021-09-26 22:16:51 +00:00
|
|
|
clock_gettime(CLOCK_MONOTONIC, &now);
|
|
|
|
|
2024-01-16 00:50:43 +00:00
|
|
|
flow_init();
|
|
|
|
|
2022-05-01 04:36:34 +00:00
|
|
|
if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c)))
|
2021-03-20 20:12:19 +00:00
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
|
2022-10-26 15:55:53 +00:00
|
|
|
if (!c.no_icmp)
|
|
|
|
icmp_init();
|
|
|
|
|
2023-08-22 05:29:57 +00:00
|
|
|
proto_update_l2_buf(c.mac_guest, c.mac);
|
2021-10-05 19:15:01 +00:00
|
|
|
|
2022-07-22 05:31:17 +00:00
|
|
|
if (c.ifi4 && !c.no_dhcp)
|
2021-10-05 19:15:01 +00:00
|
|
|
dhcp_init();
|
|
|
|
|
2022-07-22 05:31:17 +00:00
|
|
|
if (c.ifi6 && !c.no_dhcpv6)
|
2021-04-13 19:59:47 +00:00
|
|
|
dhcpv6_init(&c);
|
|
|
|
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
pcap_init(&c);
|
|
|
|
|
2022-04-05 06:15:29 +00:00
|
|
|
if (!c.foreground) {
|
2022-08-23 06:31:51 +00:00
|
|
|
if ((devnull_fd = open("/dev/null", O_RDWR | O_CLOEXEC)) < 0) {
|
2022-04-05 06:15:29 +00:00
|
|
|
perror("/dev/null open");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
}
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
|
2022-04-05 06:15:29 +00:00
|
|
|
if (*c.pid_file) {
|
|
|
|
if ((pidfile_fd = open(c.pid_file,
|
2022-07-22 17:30:10 +00:00
|
|
|
O_CREAT | O_TRUNC | O_WRONLY | O_CLOEXEC,
|
2022-04-05 06:15:29 +00:00
|
|
|
S_IRUSR | S_IWUSR)) < 0) {
|
|
|
|
perror("PID file open");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
}
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
|
2023-02-15 08:24:37 +00:00
|
|
|
if (isolate_prefork(&c))
|
|
|
|
die("Failed to sandbox process, exiting");
|
2021-09-26 22:16:51 +00:00
|
|
|
|
log: setlogmask(0) can actually result in a system call, don't use it
Before commit 32d07f5e59f2 ("passt, pasta: Completely avoid dynamic
memory allocation"), we didn't store the current log mask in a
variable, and we fetched it using setlogmask(0) wherever needed.
But after that commit, we can use our log_mask copy instead. And we
should: with recent glibc versions, setlogmask(0) actually results in
a system call, which causes a substantial overhead with high transfer
rates: we use setlogmask(0) even to decide we don't want to print
debug messages.
Now that we rely on log_mask in early stages, before setlogmask() is
called, we need to initialise that variable to the special LOG_EMERG
mask value right away: define LOG_EARLY to make this clearer, and,
while at it, group conditions in vlogmsg() into something more terse.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-02-01 23:22:16 +00:00
|
|
|
/* Once the log mask is not LOG_EARLY, we will no longer log to stderr
|
|
|
|
* if there was a log file specified.
|
2023-02-15 08:24:29 +00:00
|
|
|
*/
|
|
|
|
if (c.debug)
|
|
|
|
__setlogmask(LOG_UPTO(LOG_DEBUG));
|
|
|
|
else if (c.quiet)
|
|
|
|
__setlogmask(LOG_UPTO(LOG_ERR));
|
|
|
|
else
|
|
|
|
__setlogmask(LOG_UPTO(LOG_INFO));
|
|
|
|
|
passt, pasta: Namespace-based sandboxing, defer seccomp policy application
To reach (at least) a conceptually equivalent security level as
implemented by --enable-sandbox in slirp4netns, we need to create a
new mount namespace and pivot_root() into a new (empty) mountpoint, so
that passt and pasta can't access any filesystem resource after
initialisation.
While at it, also detach IPC, PID (only for passt, to prevent
vulnerabilities based on the knowledge of a target PID), and UTS
namespaces.
With this approach, if we apply the seccomp filters right after the
configuration step, the number of allowed syscalls grows further. To
prevent this, defer the application of seccomp policies after the
initialisation phase, before the main loop, that's where we expect bad
things to happen, potentially. This way, we get back to 22 allowed
syscalls for passt and 34 for pasta, on x86_64.
While at it, move #syscalls notes to specific code paths wherever it
conceptually makes sense.
We have to open all the file handles we'll ever need before
sandboxing:
- the packet capture file can only be opened once, drop instance
numbers from the default path and use the (pre-sandbox) PID instead
- /proc/net/tcp{,v6} and /proc/net/udp{,v6}, for automatic detection
of bound ports in pasta mode, are now opened only once, before
sandboxing, and their handles are stored in the execution context
- the UNIX domain socket for passt is also bound only once, before
sandboxing: to reject clients after the first one, instead of
closing the listening socket, keep it open, accept and immediately
discard new connection if we already have a valid one
Clarify the (unchanged) behaviour for --netns-only in the man page.
To actually make passt and pasta processes run in a separate PID
namespace, we need to unshare(CLONE_NEWPID) before forking to
background (if configured to do so). Introduce a small daemon()
implementation, __daemon(), that additionally saves the PID file
before forking. While running in foreground, the process itself can't
move to a new PID namespace (a process can't change the notion of its
own PID): mention that in the man page.
For some reason, fork() in a detached PID namespace causes SIGTERM
and SIGQUIT to be ignored, even if the handler is still reported as
SIG_DFL: add a signal handler that just exits.
We can now drop most of the pasta_child_handler() implementation,
that took care of terminating all processes running in the same
namespace, if pasta started a shell: the shell itself is now the
init process in that namespace, and all children will terminate
once the init process exits.
Issuing 'echo $$' in a detached PID namespace won't return the
actual namespace PID as seen from the init namespace: adapt
demo and test setup scripts to reflect that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-02-07 20:11:37 +00:00
|
|
|
if (!c.foreground)
|
|
|
|
__daemon(pidfile_fd, devnull_fd);
|
|
|
|
else
|
|
|
|
write_pidfile(pidfile_fd, getpid());
|
|
|
|
|
pasta: Wait for tap to be set up before spawning command
Adapted from a patch by Paul Holzinger: when pasta spawns a command,
operating without a pre-existing user and network namespace, it needs
to wait for the tap device to be configured and its handler ready,
before the command is actually executed.
Otherwise, something like:
pasta --config-net nslookup passt.top
usually fails as the nslookup command is issued before the network
interface is ready.
We can't adopt a simpler approach based on SIGSTOP and SIGCONT here:
the child runs in a separate PID namespace, so it can't send SIGSTOP
to itself as the kernel sees the child as init process and blocks
the delivery of the signal.
We could send SIGSTOP from the parent, but this wouldn't avoid the
possible condition where the child isn't ready to wait for it when
the parent sends it, also raised by Paul -- and SIGSTOP can't be
blocked, so it can never be pending.
Use SIGUSR1 instead: mask it before clone(), so that the child starts
with it blocked, and can safely wait for it. Once the parent is
ready, it sends SIGUSR1 to the child. If SIGUSR1 is sent before the
child is waiting for it, the kernel will queue it for us, because
it's blocked.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Fixes: 1392bc5ca002 ("Allow pasta to take a command to execute")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-02-12 11:22:59 +00:00
|
|
|
if (pasta_child_pid)
|
|
|
|
kill(pasta_child_pid, SIGUSR1);
|
|
|
|
|
2022-10-14 04:25:31 +00:00
|
|
|
isolate_postfork(&c);
|
2021-10-14 10:17:47 +00:00
|
|
|
|
2021-09-26 22:16:51 +00:00
|
|
|
timer_init(&c, &now);
|
2022-02-18 15:12:11 +00:00
|
|
|
|
2020-07-13 20:55:46 +00:00
|
|
|
loop:
|
2022-03-18 11:18:19 +00:00
|
|
|
/* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */
|
2022-09-28 04:33:31 +00:00
|
|
|
/* cppcheck-suppress [duplicateValueTernary, unmatchedSuppression] */
|
passt: Assorted fixes from "fresh eyes" review
A bunch of fixes not worth single commits at this stage, notably:
- make buffer, length parameter ordering consistent in ARP, DHCP,
NDP handlers
- strict checking of buffer, message and option length in DHCP
handler (a malicious client could have easily crashed it)
- set up forwarding for IPv4 and IPv6, and masquerading with nft for
IPv4, from demo script
- get rid of separate slow and fast timers, we don't save any
overhead that way
- stricter checking of buffer lengths as passed to tap handlers
- proper dequeuing from qemu socket back-end: I accidentally trashed
messages that were bundled up together in a single tap read
operation -- the length header tells us what's the size of the next
frame, but there's no apparent limit to the number of messages we
get with one single receive
- rework some bits of the TCP state machine, now passive and active
connection closes appear to be robust -- introduce a new
FIN_WAIT_1_SOCK_FIN state indicating a FIN_WAIT_1 with a FIN flag
from socket
- streamline TCP option parsing routine
- track TCP state changes to stderr (this is temporary, proper
debugging and syslogging support pending)
- observe that multiplying a number by four might very well change
its value, and this happens to be the case for the data offset
from the TCP header as we check if it's the same as the total
length to find out if it's a duplicated ACK segment
- recent estimates suggest that the duration of a millisecond is
closer to a million nanoseconds than a thousand of them, this
trend is now reflected into the timespec_diff_ms() convenience
routine
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-21 10:33:38 +00:00
|
|
|
nfds = epoll_wait(c.epollfd, events, EPOLL_EVENTS, TIMER_INTERVAL);
|
2020-07-20 14:27:43 +00:00
|
|
|
if (nfds == -1 && errno != EINTR) {
|
2020-07-13 20:55:46 +00:00
|
|
|
perror("epoll_wait");
|
|
|
|
exit(EXIT_FAILURE);
|
|
|
|
}
|
|
|
|
|
udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:
- implement an explicit, albeit minimalistic, connection tracking
for UDP, to allow usage of ephemeral ports by the guest and by
the host at the same time, by binding them dynamically as needed,
and to allow mapping address changes for packets with a loopback
address as destination
- set the guest MAC address whenever we receive a packet from tap
instead of waiting for an ARP request, and set it to broadcast on
start, otherwise DHCPv6 might not work if all DHCPv6 requests time
out before the guest starts talking IPv4
- split context IPv6 address into address we assign, global or site
address seen on tap, and link-local address seen on tap, and make
sure we use the addresses we've seen as destination (link-local
choice depends on source address). Similarly, for IPv4, split into
address we assign and address we observe, and use the address we
observe as destination
- introduce a clock_gettime() syscall right after epoll_wait() wakes
up, so that we can remove all the other ones and pass the current
timestamp to tap and socket handlers -- this is additionally needed
by UDP to time out bindings to ephemeral ports and mappings between
loopback address and a local address
- rename sock_l4_add() to sock_l4(), no semantic changes intended
- include <arpa/inet.h> in passt.c before kernel headers so that we
can use <netinet/in.h> macros to check IPv6 address types, and
remove a duplicate <linux/ip.h> inclusion
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 14:59:20 +00:00
|
|
|
clock_gettime(CLOCK_MONOTONIC, &now);
|
|
|
|
|
2020-07-13 20:55:46 +00:00
|
|
|
for (i = 0; i < nfds; i++) {
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64);
|
2023-08-11 05:12:23 +00:00
|
|
|
uint32_t eventmask = events[i].events;
|
passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:
$ unshare -rUn
# echo $$
1871759
$ ./pasta 1871759 # From another terminal
# udhcpc -i pasta0 2>/dev/null
# ping -c1 pasta.pizza
PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms
--- pasta.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
# ping -c1 spaghetti.pizza
PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms
--- spaghetti.pizza ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms
This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.
Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:
# ip link set dev lo up
# iperf3 -s
$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
[SUM] 0.00-10.00 sec 52.3 GBytes 44.9 Gbits/sec 283 sender
[SUM] 0.00-10.43 sec 52.3 GBytes 43.1 Gbits/sec receiver
iperf Done.
epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.
A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 06:34:53 +00:00
|
|
|
|
2023-08-11 05:12:23 +00:00
|
|
|
trace("%s: epoll event on %s %i (events: 0x%08x)",
|
|
|
|
c.mode == MODE_PASST ? "passt" : "pasta",
|
|
|
|
EPOLL_TYPE_STR(ref.type), ref.fd, eventmask);
|
|
|
|
|
|
|
|
switch (ref.type) {
|
2023-08-11 05:12:29 +00:00
|
|
|
case EPOLL_TYPE_TAP_PASTA:
|
|
|
|
tap_handler_pasta(&c, eventmask, &now);
|
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TAP_PASST:
|
|
|
|
tap_handler_passt(&c, eventmask, &now);
|
2023-08-11 05:12:28 +00:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TAP_LISTEN:
|
|
|
|
tap_listen_handler(&c, eventmask);
|
2023-08-11 05:12:23 +00:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_NSQUIT:
|
2023-08-11 05:12:22 +00:00
|
|
|
pasta_netns_quit_handler(&c, quit_fd);
|
2023-08-11 05:12:23 +00:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP:
|
2024-01-16 00:50:38 +00:00
|
|
|
tcp_sock_handler(&c, ref, eventmask);
|
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP_SPLICE:
|
|
|
|
tcp_splice_sock_handler(&c, ref, eventmask);
|
2023-08-11 05:12:27 +00:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_TCP_LISTEN:
|
|
|
|
tcp_listen_handler(&c, ref, &now);
|
2023-08-11 05:12:23 +00:00
|
|
|
break;
|
2023-08-11 05:12:26 +00:00
|
|
|
case EPOLL_TYPE_TCP_TIMER:
|
|
|
|
tcp_timer_handler(&c, ref);
|
|
|
|
break;
|
2023-08-11 05:12:23 +00:00
|
|
|
case EPOLL_TYPE_UDP:
|
2023-08-11 05:12:25 +00:00
|
|
|
udp_sock_handler(&c, ref, eventmask, &now);
|
2023-08-11 05:12:23 +00:00
|
|
|
break;
|
|
|
|
case EPOLL_TYPE_ICMP:
|
2024-01-16 05:16:15 +00:00
|
|
|
icmp_sock_handler(&c, AF_INET, ref);
|
2023-08-11 05:12:24 +00:00
|
|
|
break;
|
2023-08-11 05:12:23 +00:00
|
|
|
case EPOLL_TYPE_ICMPV6:
|
2024-01-16 05:16:15 +00:00
|
|
|
icmp_sock_handler(&c, AF_INET6, ref);
|
2023-08-11 05:12:23 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Can't happen */
|
|
|
|
ASSERT(0);
|
|
|
|
}
|
2020-07-13 20:55:46 +00:00
|
|
|
}
|
|
|
|
|
2021-10-05 17:20:26 +00:00
|
|
|
post_handler(&c, &now);
|
passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.
Conceptually, this follows the design presented at:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.
Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.
While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
because of explicit checks now introduced in qemu to ensure that
the corresponding file descriptor is actually a tap device. For
this reason, qrap now operates on a 'socket' back-end type,
accounting for and building the additional header reporting
frame length
- provide a demo script that sets up namespaces, addresses and
routes, and starts the daemon. A virtual machine started in the
network namespace, wrapped by qrap, will now directly interface
with passt and communicate using Layer 4 sockets provided by the
host kernel.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 06:25:09 +00:00
|
|
|
|
2020-07-13 20:55:46 +00:00
|
|
|
goto loop;
|
|
|
|
}
|