1
0
mirror of https://passt.top/passt synced 2024-06-29 22:42:46 +00:00
Commit Graph

31 Commits

Author SHA1 Message Date
Stefano Brivio
3d0de2c1d7 conf, tcp, udp: Exit if we fail to bind sockets for all given ports
passt supports ranges of forwarded ports as well as 'all' for TCP and
UDP, so it might be convenient to proceed if we fail to bind only
some of the desired ports.

But if we fail to bind even a single port for a given specification,
we're clearly, unexpectedly, conflicting with another network
service. In that case, report failure and exit.

Reported-by: Yalan Zhang <yalzhang@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-02-16 17:33:49 +01:00
David Gibson
54502cca7f udp: Use tap_send_frames()
To send frames on the tap interface, the UDP uses a fairly complicated two
level batching.  First multiple frames are gathered into a single "message"
for the qemu stream socket, then multiple messages are send with
sendmmsg().  We now have tap_send_frames() which already deals with sending
a number of frames, including batching and handling partial sends.  Use
that to considerably simplify things.

This does make a couple of behavioural changes:
  * We used to split messages to keep them under 32kiB (except when a
    single frame was longer than that).  The comments claim this is
    needed to stop qemu from closing the connection, but we don't have any
    equivalent logic for TCP.  I wasn't able to reproduce the problem with
    this series, although it was apparently easy to reproduce earlier.

    My suspicion is that there was never an inherent need to keep messages
    small, however with larger messages (and default kernel buffer sizes)
    the chances of needing more than one resend for partial send()s is
    greatly increased.  We used not to correctly handle that case of
    multiple resends, but now we do.

  * Previously when we got a partial send on UDP, we would resend the
    remainder of the entire "message", including multiple frames.  The
    common code now only resends the remainder of a single frame, simply
    dropping any frames which weren't even partially sent.  This is what
    TCP always did and is probably a better idea for UDP too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-23 18:55:04 +01:00
David Gibson
8d503e825f udp: Decide whether to "splice" per datagram rather than per socket
Currently we have special sockets for receiving datagrams from locahost
which can use the optimized "splice" path rather than going across the tap
interface.

We want to loosen this so that sockets can receive sockets that will be
forwarded by both the spliced and non-spliced paths.  To do this, we alter
the meaning of the @splice bit in the reference to mean that packets
receieved on this socket *can* be spliced, not that they *will* be spliced.
They'll only actually be spliced if they come from 127.0.0.1 or ::1.

We can't (for now) remove the splice bit entirely, unlike with TCP.  Our
gateway mapping means that if the ns initiates communication to the gw
address, we'll translate that to target 127.0.0.1 on the host side.  Reply
packets will therefore have source address 127.0.0.1 when received on the
host, but these need to go via the tap path where that will be translated
back to the gateway address.  We need the @splice bit to distinguish that
case from packets going from localhost to a port mapped explicitly with
-u which should be spliced.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-01-13 01:07:09 +01:00
David Gibson
d9394eb9b7 udp: Split splice field in udp_epoll_ref into (mostly) independent bits
The @splice field in union udp_epoll_ref can have a number of values for
different types of "spliced" packet flows.  Split it into several single
bit fields with more or less independent meanings.  The new @splice field
is just a boolean indicating whether the socket is associated with a
spliced flow, making it identical to the @splice fiend in tcp_epoll_ref.

The new bit @orig, indicates whether this is a socket which can originate
new udp packet flows (created with -u or -U) or a socket created on the
fly to handle reply socket.  @ns indicates whether the socket lives in the
init namespace or the pasta namespace.

Making these bits more orthogonal to each other will simplify some future
cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:31 +01:00
David Gibson
8517239243 udp: Remove the @bound field from union udp_epoll_ref
We set this field, but nothing ever checked it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-12-06 07:41:28 +01:00
David Gibson
7c7b68dbe0 Use typing to reduce chances of IPv4 endianness errors
We recently corrected some errors handling the endianness of IPv4
addresses.  These are very easy errors to make since although we mostly
store them in network endianness, we sometimes need to manipulate them in
host endianness.

To reduce the chances of making such mistakes again, change to always using
a (struct in_addr) instead of a bare in_addr_t or uint32_t to store network
endian addresses.  This makes it harder to accidentally do arithmetic or
comparisons on such addresses as if they were host endian.

We introduce a number of IN4_IS_ADDR_*() helpers to make it easier to
directly work with struct in_addr values.  This has the additional benefit
of making the IPv4 and IPv6 paths more visually similar.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-11-04 12:04:24 +01:00
Stefano Brivio
c1eff9a3c6 conf, tcp, udp: Allow specification of interface to bind to
Since kernel version 5.7, commit c427bfec18f2 ("net: core: enable
SO_BINDTODEVICE for non-root users"), we can bind sockets to
interfaces, if they haven't been bound yet (as in bind()).

Introduce an optional interface specification for forwarded ports,
prefixed by %, that can be passed together with an address.

Reported use case: running local services that use ports we want
to have externally forwarded:
  https://github.com/containers/podman/issues/14425

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2022-10-15 02:10:36 +02:00
David Gibson
d5b80ccc72 Fix widespread off-by-one error dealing with port numbers
Port numbers (for both TCP and UDP) are 16-bit, and so fit exactly into a
'short'.  USHRT_MAX is therefore the maximum port number and this is widely
used in the code.  Unfortunately, a lot of those places don't actually
want the maximum port number (USHRT_MAX == 65535), they want the total
number of ports (65536).  This leads to a number of potentially nasty
consequences:

 * We have buffer overruns on the port_fwd::delta array if we try to use
   port 65535
 * We have similar potential overruns for the tcp_sock_* arrays
 * Interestingly udp_act had the correct size, but we can calculate it in
   a more direct manner
 * We have a logical overrun of the ports bitmap as well, although it will
   just use an unused bit in the last byte so isnt harmful
 * Many loops don't consider port 65535 (which does mitigate some but not
   all of the buffer overruns above)
 * In udp_invert_portmap() we incorrectly compute the reverse port
   translation for return packets

Correct all these by using a new NUM_PORTS defined explicitly for this
purpose.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
David Gibson
f5a31ee94c Don't use indirect remap functions for conf_ports()
Now that we've delayed initialization of the UDP specific "reverse" map
until udp_init(), the only difference between the various 'remap' functions
used in conf_ports() is which array they target.  So, simplify by open
coding the logic into conf_ports() with a pointer to the correct mapping
array.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
David Gibson
1467a35b5a udp: Delay initialization of UDP reversed port mapping table
Because it's connectionless, when mapping UDP ports we need, in addition
to the table of deltas for destination ports needed by TCP, we need an
inverted table to translate the source ports on return packets.

Currently we fill out the inverted table at the same time we construct the
main table in udp_remap_to_tap() and udp_remap_to_init().  However, we
don't use either table until after we've initialized UDP, so we can delay
the construction of the reverse table to udp_init().  This makes the
configuration more symmetric between TCP and UDP which will enable further
cleanups.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
David Gibson
163dc5f188 Consolidate port forwarding configuration into a common structure
The configuration for how to forward ports in and out of the guest/ns is
divided between several different variables.  For each connect direction
and protocol we have a mode in the udp/tcp context structure, a bitmap
of which ports to forward also in the context structure and an array of
deltas to apply if the outward facing and inward facing port numbers are
different.  This last is a separate global variable, rather than being in
the context structure, for no particular reason.  UDP also requires an
additional array which has the reverse mapping used for return packets.

Consolidate these into a re-used substructure in the context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
David Gibson
1128fa03fe Improve types and names for port forwarding configuration
enum conf_port_type is local to conf.c and is used to track the port
forwarding mode during configuration.  We don't keep it around in the
context structure, however the 'init_detect_ports' and 'ns_detect_ports'
fields in the context are based solely on this.  Rather than changing
encoding, just include the forwarding mode into the context structure.
Move the type definition to a new port_fwd.h, which is kind of trivial at
the moment but will have more stuff later.

While we're there, "conf_port_type" doesn't really convey that this enum is
describing how port forwarding is configured.  Rename it to port_fwd_mode.
The variables (now fields) of this type also have mildly confusing names
since it's not immediately obvious whether 'ns' and 'init' refer to the
source or destination of the packets.  Use "in" (host to guest / init to
ns) and "out" (guest to host / ns to init) instead.

This has the added bonus that we no longer have locals 'udp_init' and
'tcp_init' which shadow global functions.

In addition, add a typedef 'port_fwd_map' for a bitmap of each port number,
which is used in several places.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-09-24 14:48:35 +02:00
Stefano Brivio
4a1b675278 conf, tcp, udp: Arrays for ports need 2^16 values, not 2^16-8
Reported by David but also by Coverity (CWE-119):

	In conf_ports: Out-of-bounds access to a buffer

...not in practice, because the allocation size is rounded up
anyway, but not nice either.

Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-09-22 16:54:09 +02:00
Stefano Brivio
3c6ae62510 conf, tcp, udp: Allow address specification for forwarded ports
This feature is available in slirp4netns but was missing in passt and
pasta.

Given that we don't do dynamic memory allocation, we need to bind
sockets while parsing port configuration. This means we need to
process all other options first, as they might affect addressing and
IP version support. It also implies a minor rework of how TCP and UDP
implementations bind sockets.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-05-01 07:19:05 +02:00
Stefano Brivio
48582bf47f treewide: Mark constant references as const
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
bb70811183 treewide: Packet abstraction with mandatory boundary checks
Implement a packet abstraction providing boundary and size checks
based on packet descriptors: packets stored in a buffer can be queued
into a pool (without storage of its own), and data can be retrieved
referring to an index in the pool, specifying offset and length.

Checks ensure data is not read outside the boundaries of buffer and
descriptors, and that packets added to a pool are within the buffer
range with valid offset and indices.

This implies a wider rework: usage of the "queueing" part of the
abstraction mostly affects tap_handler_{passt,pasta}() functions and
their callees, while the "fetching" part affects all the guest or tap
facing implementations: TCP, UDP, ICMP, ARP, NDP, DHCP and DHCPv6
handlers.

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2022-03-29 15:35:38 +02:00
Stefano Brivio
dd942eaa48 passt: Fix build with gcc 7, use std=c99, enable some more Clang checkers
Unions and structs, you all have names now.

Take the chance to enable bugprone-reserved-identifier,
cert-dcl37-c, and cert-dcl51-cpp checkers in clang-tidy.

Provide a ffsl() weak declaration using gcc built-in.

Start reordering includes, but that's not enough for the
llvm-include-order checker yet.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-21 04:26:08 +02:00
Stefano Brivio
12cfa6444c passt: Add clang-tidy Makefile target and test, take care of warnings
Most are just about style and form, but a few were actually
serious mistakes (NDP-related).

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:34:22 +02:00
Stefano Brivio
087b5f4dbb LICENSES: Add license text files, add missing notices, fix SPDX tags
SPDX tags don't replace license files. Some notices were missing and
some tags were not according to the SPDX specification, too.

Now reuse --lint from the REUSE tool (https://reuse.software/) passes.

Reported-by: Martin Hauke <mardnh@gmx.de>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-10-20 08:29:30 +02:00
Stefano Brivio
9657b6ed05 conf, tcp: Periodic detection of bound ports for pasta port forwarding
Detecting bound ports at start-up time isn't terribly useful: do this
periodically instead, if configured.

This is only implemented for TCP at the moment, UDP is somewhat more
complicated: leave a TODO there.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 11:23:44 +02:00
Stefano Brivio
dd581730e5 tap: Completely de-serialise input message batches
Until now, messages would be passed to protocol handlers in a single
batch only if they happened to be dequeued in a row. Packets
interleaved between different connections would result in multiple
calls to the same protocol handler for a single connection.

Instead, keep track of incoming packet descriptors, arrange them in
sequences, and call protocol handlers only as we completely sorted
input messages in batches.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-27 01:28:02 +02:00
Stefano Brivio
1e49d194d0 passt, pasta: Introduce command-line options and port re-mapping
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-09-01 17:00:27 +02:00
Stefano Brivio
86b273150a tcp, udp: Allow binding ports in init namespace to both tap and loopback
Traffic with loopback source address will be forwarded to the direct
loopback connection in the namespace, and the tap interface is used
for the rest.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-26 14:10:29 +02:00
Stefano Brivio
49631a38a6 tcp, udp: Split IPv4 and IPv6 bound port sets
Allow to bind IPv4 and IPv6 ports to tap, namespace or init separately.

Port numbers of TCP ports that are bound in a namespace are also bound
for UDP for convenience (e.g. iperf3), and IPv4 ports are always bound
if the corresponding IPv6 port is bound (socket might not have the
IPV6_V6ONLY option set). This will also be configurable later.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-21 17:44:39 +02:00
Stefano Brivio
64a0ba3b27 udp: Introduce recvmmsg()/sendmmsg(), zero-copy path from socket
Packets are received directly onto pre-cooked, static buffers
for IPv4 (with partial checksum pre-calculation) and IPv6 frames,
with pre-filled Ethernet addresses and, partially, IP headers,
and sent out from the same buffers with sendmmsg(), for both
passt and pasta (non-local traffic only) modes.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-21 12:01:04 +02:00
Stefano Brivio
33482d5bf2 passt: Add PASTA mode, major rework
PASTA (Pack A Subtle Tap Abstraction) provides quasi-native host
connectivity to an otherwise disconnected, unprivileged network
and user namespace, similarly to slirp4netns. Given that the
implementation is largely overlapping with PASST, no separate binary
is built: 'pasta' (and 'passt4netns' for clarity) both link to
'passt', and the mode of operation is selected depending on how the
binary is invoked. Usage example:

	$ unshare -rUn
	# echo $$
	1871759

	$ ./pasta 1871759	# From another terminal

	# udhcpc -i pasta0 2>/dev/null
	# ping -c1 pasta.pizza
	PING pasta.pizza (64.190.62.111) 56(84) bytes of data.
	64 bytes from 64.190.62.111 (64.190.62.111): icmp_seq=1 ttl=255 time=34.6 ms

	--- pasta.pizza ping statistics ---
	1 packets transmitted, 1 received, 0% packet loss, time 0ms
	rtt min/avg/max/mdev = 34.575/34.575/34.575/0.000 ms
	# ping -c1 spaghetti.pizza
	PING spaghetti.pizza(2606:4700:3034::6815:147a (2606:4700:3034::6815:147a)) 56 data bytes
	64 bytes from 2606:4700:3034::6815:147a (2606:4700:3034::6815:147a): icmp_seq=1 ttl=255 time=29.0 ms

	--- spaghetti.pizza ping statistics ---
	1 packets transmitted, 1 received, 0% packet loss, time 0ms
	rtt min/avg/max/mdev = 28.967/28.967/28.967/0.000 ms

This entails a major rework, especially with regard to the storage of
tracked connections and to the semantics of epoll(7) references.

Indexing TCP and UDP bindings merely by socket proved to be
inflexible and unsuitable to handle different connection flows: pasta
also provides Layer-2 to Layer-2 socket mapping between init and a
separate namespace for local connections, using a pair of splice()
system calls for TCP, and a recvmmsg()/sendmmsg() pair for UDP local
bindings. For instance, building on the previous example:

	# ip link set dev lo up
	# iperf3 -s

	$ iperf3 -c ::1 -Z -w 32M -l 1024k -P2 | tail -n4
	[SUM]   0.00-10.00  sec  52.3 GBytes  44.9 Gbits/sec  283             sender
	[SUM]   0.00-10.43  sec  52.3 GBytes  43.1 Gbits/sec                  receiver

	iperf Done.

epoll(7) references now include a generic part in order to
demultiplex data to the relevant protocol handler, using 24
bits for the socket number, and an opaque portion reserved for
usage by the single protocol handlers, in order to track sockets
back to corresponding connections and bindings.

A number of fixes pertaining to TCP state machine and congestion
window handling are also included here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-07-17 11:04:22 +02:00
Stefano Brivio
e07f539ae0 udp, passt: Introduce socket packet buffer, avoid getsockname() for UDP
This is in preparation for scatter-gather IO on the UDP receive path:
save a getsockname() syscall by setting a flag if we get the numbering
of all bound sockets in a strict sequence (expected, in practice) and
repurpose the tap buffer to be also a socket receive buffer, passing
it down to protocol handlers.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-30 14:52:18 +02:00
Stefano Brivio
605af213c5 udp: Connection tracking for ephemeral, local ports, and related fixes
As we support UDP forwarding for packets that are sent to local
ports, we actually need some kind of connection tracking for UDP.
While at it, this commit introduces a number of vaguely related fixes
for issues observed while trying this out. In detail:

- implement an explicit, albeit minimalistic, connection tracking
  for UDP, to allow usage of ephemeral ports by the guest and by
  the host at the same time, by binding them dynamically as needed,
  and to allow mapping address changes for packets with a loopback
  address as destination

- set the guest MAC address whenever we receive a packet from tap
  instead of waiting for an ARP request, and set it to broadcast on
  start, otherwise DHCPv6 might not work if all DHCPv6 requests time
  out before the guest starts talking IPv4

- split context IPv6 address into address we assign, global or site
  address seen on tap, and link-local address seen on tap, and make
  sure we use the addresses we've seen as destination (link-local
  choice depends on source address). Similarly, for IPv4, split into
  address we assign and address we observe, and use the address we
  observe as destination

- introduce a clock_gettime() syscall right after epoll_wait() wakes
  up, so that we can remove all the other ones and pass the current
  timestamp to tap and socket handlers -- this is additionally needed
  by UDP to time out bindings to ephemeral ports and mappings between
  loopback address and a local address

- rename sock_l4_add() to sock_l4(), no semantic changes intended

- include <arpa/inet.h> in passt.c before kernel headers so that we
  can use <netinet/in.h> macros to check IPv6 address types, and
  remove a duplicate <linux/ip.h> inclusion

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-29 17:15:26 +02:00
Stefano Brivio
38b50dba47 passt: Spare some syscalls, add some optimisations from profiling
Avoid a bunch of syscalls on forwarding paths by:

- storing minimum and maximum file descriptor numbers for each
  protocol, fall back to SO_PROTOCOL query only on overlaps

- allocating a larger receive buffer -- this can result in more
  coalesced packets than sendmmsg() can take (UIO_MAXIOV, i.e. 1024),
  so make sure we don't exceed that within a single call to protocol
  tap handlers

- nesting the handling loop in tap_handler() in the receive loop,
  so that we have better chances of filling our receive buffer in
  fewer calls

- skipping the recvfrom() in the UDP handler on EPOLLERR -- there's
  nothing to be done in that case

and while at it:

- restore the 20ms timer interval for periodic (TCP) events, I
  accidentally changed that to 100ms in an earlier commit

- attempt using SO_ZEROCOPY for UDP -- if it's not available,
  sendmmsg() will succeed anyway

- fix the handling of the status code from sendmmsg(), if it fails,
  we'll try to discard the first message, hence return 1 from the
  UDP handler

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-23 22:22:37 +02:00
Stefano Brivio
1f7cf04d34 passt: Introduce packet batching mechanism
Receive packets in batches from AF_UNIX, check if they can be sent
with a single syscall, and batch them up with sendmmsg() in case.

A bit rudimentary, currently only implemented for UDP, but it seems
to work.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-04-22 13:39:36 +02:00
Stefano Brivio
105b916361 passt: New design and implementation with native Layer 4 sockets
This is a reimplementation, partially building on the earlier draft,
that uses L4 sockets (SOCK_DGRAM, SOCK_STREAM) instead of SOCK_RAW,
providing L4-L2 translation functionality without requiring any
security capability.

Conceptually, this follows the design presented at:
	https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md

The most significant novelty here comes from TCP and UDP translation
layers. In particular, the TCP state and translation logic follows
the intent of being minimalistic, without reimplementing a full TCP
stack in either direction, and synchronising as much as possible the
TCP dynamic and flows between guest and host kernel.

Another important introduction concerns addressing, port translation
and forwarding. The Layer 4 implementations now attempt to bind on
all unbound ports, in order to forward connections in a transparent
way.

While at it:
- the qemu 'tap' back-end can't be used as-is by qrap anymore,
  because of explicit checks now introduced in qemu to ensure that
  the corresponding file descriptor is actually a tap device. For
  this reason, qrap now operates on a 'socket' back-end type,
  accounting for and building the additional header reporting
  frame length

- provide a demo script that sets up namespaces, addresses and
  routes, and starts the daemon. A virtual machine started in the
  network namespace, wrapped by qrap, will now directly interface
  with passt and communicate using Layer 4 sockets provided by the
  host kernel.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2021-02-16 09:28:55 +01:00