1
0
mirror of https://passt.top/passt synced 2024-12-22 13:45:32 +00:00
Commit Graph

107 Commits

Author SHA1 Message Date
Laurent Vivier
28997fcb29 vhost-user: add vhost-user
add virtio and vhost-user functions to connect with QEMU.

  $ ./passt --vhost-user

and

  # qemu-system-x86_64 ... -m 4G \
        -object memory-backend-memfd,id=memfd0,share=on,size=4G \
        -numa node,memdev=memfd0 \
        -chardev socket,id=chr0,path=/tmp/passt_1.socket \
        -netdev vhost-user,id=netdev0,chardev=chr0 \
        -device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
        ...

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
 before including <linux/if_ether.h> as C libraries such as musl
 __UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
 a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 16:47:32 +01:00
Stefano Brivio
14b84a7f07 treewide: Introduce 'local mode' for disconnected setups
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.

To this end, if we don't find any usable host interface, instead of
exiting:

- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
  as default gateway

- for IPv6, don't assign any address (forcibly disable DHCPv6), and
  use the *first* link-local address we observe to represent the
  guest/container. Advertise fe80::1 as default gateway

- use 'tap0' as default interface name for pasta

Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.

Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).

Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-11-27 05:16:38 +01:00
David Gibson
b4dace8f46 fwd: Direct inbound spliced forwards to the guest's external address
In pasta mode, where addressing permits we "splice" connections, forwarding
directly from host socket to guest/container socket without any L2 or L3
processing.  This gives us a very large performance improvement when it's
possible.

Since the traffic is from a local socket within the guest, it will go over
the guest's 'lo' interface, and accordingly we set the guest side address
to be the loopback address.  However this has a surprising side effect:
sometimes guests will run services that are only supposed to be used within
the guest and are therefore bound to only 127.0.0.1 and/or ::1.  pasta's
forwarding exposes those services to the host, which isn't generally what
we want.

Correct this by instead forwarding inbound "splice" flows to the guest's
external address.

Link: https://github.com/containers/podman/issues/24045
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-18 20:28:03 +02:00
David Gibson
9d66df9a9a conf: Add command line switch to enable IP_FREEBIND socket option
In a couple of recent reports, we've seen that it can be useful for pasta
to forward ports from addresses which are not currently configured on the
host, but might be in future.  That can be done with the sysctl
net.ipv4.ip_nonlocal_bind, but that does require CAP_NET_ADMIN to set in
the first place.  We can allow the same thing on a per-socket basis with
the IP_FREEBIND (or IPV6_FREEBIND) socket option.

Add a --freebind command line argument to enable this socket option on
all listening sockets.

Link: https://bugs.passt.top/show_bug.cgi?id=101
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-10-04 19:04:29 +02:00
David Gibson
57b7bd2a48 fwd, conf: Allow NAT of the guest's assigned address
The guest is usually assigned one of the host's IP addresses.  That means
it can't access the host itself via its usual address.  The
--map-host-loopback option (enabled by default with the gateway address)
allows the guest to contact the host.  However, connections forwarded this
way appear on the host to have originated from the loopback interface,
which isn't always desirable.

Add a new --map-guest-addr option, which acts similarly but forwarded
connections will go to the host's external address, instead of loopback.

If '-a' is used, so the guest's address is not the same as the host's, this
will instead forward to whatever host-visible site is shadowed by the
guest's assigned address.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:40 +02:00
David Gibson
935bd81936 conf, fwd: Split notion of gateway/router from guest-visible host address
The @gw fields in the ip4_ctx and ip6_ctx give the (host's) default
gateway.  We use this for two quite distinct things: advertising the
gateway that the guest should use (via DHCP, NDP and/or --config-net)
and for a limited form of NAT.  So that the guest can access services
on the host, we map the gateway address within the guest to the
loopback address on the host.

Using the gateway address for this isn't necessarily the best choice
for this purpose, certainly not for all circumstances.  So, start off
by splitting the notion of these into two different values: @guest_gw
which is the gateway address the guest should use and @nat_host_loopback,
which is the guest visible address to remap to the host's loopback.

Usually nat_host_loopback will have the same value as guest_gw.  However
when --no-map-gw is specified we leave them unspecified instead.  This
means when we use nat_host_loopback, we don't need to separately check
c->no_map_gw to see if it's relevant.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:31 +02:00
David Gibson
90e83d50a9 Don't take "our" MAC address from the host
When sending frames to the guest over the tap link, we need a source MAC
address.  Currently we take that from the MAC address of the main interface
on the host, but that doesn't actually make much sense:
 * We can't preserve the real MAC address of packets from anywhere
   external so there's no transparency case here
 * In fact, it's confusingly different from how we handle IP addresses:
   whereas we give the guest the same IP as the host, we're making the
   host's MAC the one MAC that the guest *can't* use for itself.
 * We already need a fallback case if the host doesn't have an Ethernet
   like MAC (e.g. if it's connected via a point to point interface, such
   as a wireguard VPN).

Change to just just use an arbitrary fixed MAC address - I've picked
9a:55:9a:55:9a:55.  It's simpler and has the small advantage of making
the fact that passt/pasta is in use typically obvious from guest side
packet dumps.  This can still, of course, be overridden with the -M option.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:28 +02:00
David Gibson
356de97e43 fwd: Split notion of "our tap address" from gateway for IPv4
ip4.gw conflates 3 conceptually different things, which (for now) have the
same value:
  1. The router/gateway address as seen by the guest
  2. An address to NAT to the host with --no-map-gw isn't specified
  3. An address to use as source when nothing else makes sense

Case 3 occurs in two situations:

a) for our DHCP responses - since they come from passt internally there's
   no naturally meaningful address for them to come from
b) for forwarded connections coming from an address that isn't guest
   accessible (localhost or the guest's own address).

(b) occurs even with --no-map-gw, and the expected behaviour of forwarding
local connections requires it.

For IPv6 role (3) is now taken by ip6.our_tap_ll (which usually has the
same value as ip6.gw).  For future flexibility we may want to make this
"address of last resort" different from the gateway address, so split them
logically for IPv4 as well.

Specifically, add a new ip4.our_tap_addr field for the address with this
role, and initialise it to ip4.gw for now.  Unlike IPv6 where we can always
get a link-local address, we might not be able to get a (non 0.0.0.0)
address here (e.g. if the host is disconnected or only has a point to point
link with no gateway address).  In that case we have to disable forwarding
of inbound connections with guest-inaccessible source addresses.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:26 +02:00
David Gibson
8d4baa4446 Clarify which addresses in ip[46]_ctx are meaningful where
Some are guest visible addresses and may not be valid on the host, others
are host visible addresses and may not be valid on the guest.  Rearrange
and comment the ip[46]_ctx definitions to make it clearer which is which.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:19 +02:00
David Gibson
a42fb9c000 treewide: Change misleading 'addr_ll' name
c->ip6.addr_ll is not like c->ip6.addr.  The latter is an address for the
guest, but the former is an address for our use on the tap link.  Rename it
accordingly, to 'our_tap_ll'.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 12:00:16 +02:00
David Gibson
905ecd2b0b treewide: Rename MAC address fields for clarity
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it.  Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac".  Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-21 11:59:54 +02:00
David Gibson
86bdd968ea Correct inaccurate comments on ip[46]_ctx::addr
These fields are described as being an address for an external, routable
interface.  That's not necessarily the case when using -a.  But, more
importantly, saying where the value comes from is not as useful as what
it's used for.  The real purpose of this field is as the address which we
assign to the guest via DHCP or --config-net.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-08-12 21:29:21 +02:00
Stefano Brivio
fbb0c9523e conf, pasta: Make -g and -a skip route/addresses copy for matching IP version only
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.

That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.

Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.

While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.

Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-08-07 09:15:25 +02:00
David Gibson
57a21d2df1 tap: Improve handling of partially received frames on qemu socket
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie.  Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.

Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket.  This handles a number of (unlikely) cases
which previously would not be dealt with correctly:

* If qemu sent a partial frame then waited some time before sending the
  remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
  doing the partial frame handling, which would put us out of sync with
  the stream from qemu
* If a the blocking recv() only received some of the remainder of the
  frame, not all of it, we'd return leaving us out of sync with the
  stream again

Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU).  This
is probably acceptable because it's an unlikely case in practice.  If
necessary we could mitigate this by using a true ring buffer.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-26 14:07:42 +02:00
David Gibson
882599e180 udp: Rename UDP listening sockets
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows.  Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match.  Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:34:01 +02:00
David Gibson
8012f5ff55 flow: Common address information for initiating side
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow.  Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).

To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.

For now we only populate the initiating side, requiring that
information be completed when a flows enter INI.  Later patches will
populate the target side.

For now this leaves some information redundantly recorded in both generic
and type specific fields.  We'll fix that in later patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-19 18:32:32 +02:00
David Gibson
74c1c5efcf util: sock_l4() determine protocol from epoll type rather than the reverse
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state.  Currently it determines the correct tag for the epoll
data based on the protocol.  However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol.  So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-07-05 15:26:09 +02:00
Stefano Brivio
bca0fefa32 conf, passt: Make --stderr do nothing, and deprecate it
The original behaviour of printing messages to standard error by
default when running from a non-interactive terminal was introduced
because the first KubeVirt integration draft used to start passt in
foreground and get messages via standard error.

For development purposes, the system logger was more convenient at
that point, and passt was running from interactive terminals only if
not started by the KubeVirt integration.

This behaviour was introduced by 84a62b79a2 ("passt: Also log to
stderr, don't fork to background if not interactive").

Later, I added command-line options in 1e49d194d0 ("passt, pasta:
Introduce command-line options and port re-mapping") and accidentally
reversed this condition, which wasn't a problem as --stderr could
force printing to standard error anyway (and it was used by KubeVirt).

Nowadays, the KubeVirt integration uses a log file (requested via
libvirt configuration), and the same applies for Podman if one
actually needs to look at runtime logs. There are no use cases left,
as far as I know, where passt runs in foreground in non-interactive
terminals.

Seize the chance to reintroduce some sanity here. If we fork to
background, standard error is closed, so --stderr is useless in that
case.

If we run in foreground, there's no harm in printing messages to
standard error, and that accidentally became the default behaviour
anyway, so --stderr is not needed in that case.

It would be needed for non-interactive terminals, but there are no
use cases, and if there were, let's log to standard error anyway:
the user can always redirect standard error to /dev/null if needed.

Before we're up and running, we need to print to standard error anyway
if something happens, otherwise we can't report failure to start in
any kind of usage, stand-alone or in integrations.

So, make --stderr do nothing, and deprecate it.

While at it, drop a left-over comment about --foreground being the
default only for interactive terminals, because it's not the case
anymore.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2024-06-21 15:32:28 +02:00
Stefano Brivio
0608ec42f2 conf, passt.h: Rename pid_file in struct ctx to pidfile
We have pidfile_fd now, pid_file_fd would be quite ugly.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:44:14 +02:00
Stefano Brivio
c9b2413465 conf, passt, tap: Open socket and PID files before switching UID/GID
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.

Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
2024-05-23 16:43:26 +02:00
David Gibson
1095a7b0c9 treewide: Remove misleading and redundant endianness notes
In general, it's much less error-prone to have the endianness of values
implied by the type, rather than just noting it in comments.  We can't
always easily avoid it, because C, but we can do so when possible.  struct
in_addr and in6_addr are always encoded network endian, so noting it
explicitly isn't useful.  Remove them.

In some cases we also have endianness notes on uint8_t parameters, which
doesn't make sense: for a single byte endianness is irrelevant.  Remove
those too.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:16 +02:00
David Gibson
5d37dab012 tap: Remove unused structs tap_msg, tap_l4_msg
Use of these structures was removed in bb70811183 ("treewide: Packet
abstraction with mandatory boundary checks").  Remove the stale
declarations.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-05-02 16:13:13 +02:00
David Gibson
4779dfe12f icmp: Use 'flowside' epoll references for ping sockets
Currently ping sockets use a custom epoll reference type which includes
the ICMP id.  However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id.  Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.

Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-03-12 01:49:05 +01:00
David Gibson
3b9098aa49 fwd: Rename port_fwd.[ch] and their contents
Currently port_fwd.[ch] contains helpers related to port forwarding,
particular automatic port forwarding.  We're planning to allow much more
flexible sorts of forwarding, including both port translation and NAT based
on the flow table.  This will subsume the existing port forwarding logic,
so rename port_fwd.[ch] to fwd.[ch] with matching updates to all the names
within.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-29 09:48:27 +01:00
Stefano Brivio
8f3f8e190c pasta: Add fallback timer mechanism to check if namespace is gone
We don't know how frequently this happens, but hitting
fs.inotify.max_user_watches or similar sysctl limits is definitely
not out of question, and Paul mentioned that, for example, Podman's
CI environments hit similar issues in the past.

Introduce a fallback mechanism based on a timer file descriptor: we
grab the directory handle at startup, and we can then use openat(),
triggered periodically, to check if the (network) namespace directory
still exists. If openat() fails at some point, exit.

Link: https://github.com/containers/podman/pull/21563#issuecomment-1943505707
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-02-16 08:47:14 +01:00
David Gibson
fb7c00169d flow: Move flow_count from context structure to a global
In general, the passt code is a bit haphazard about what's a true global
variable and what's in the quasi-global 'context structure'.  The
flow_count field is one such example: it's in the context structure,
although it's really part of the same data structure as flowtab[], which
is a genuine global.

Move flow_count to be a regular global to match.  For now it needs to be
public, rather than static, but we expect to be able to change that in
future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:29 +01:00
David Gibson
02e092b4fe tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
Currently connected TCP sockets have the same epoll type, whether they're
for a "tap" connection or a spliced connection.  This means that
tcp_sock_handler() has to do a secondary check on the type of the
connection to call the right function.  We can avoid this by adding a new
epoll type and dispatching directly to the right thing.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:25 +01:00
David Gibson
70121ca1ec epoll: Better handling of number of epoll types
As we already did for flow types, use an "EPOLL_NUM_TYPES" isntead of
EPOLL_TYPE_MAX, which is a little bit safer and clearer.  Add a static
assert on the size of the matching names array.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2024-01-22 23:35:22 +01:00
David Gibson
e21b6d69b1 tcp: "TCP" hash secret doesn't need to be TCP specific
The TCP state structure includes a 128-bit hash_secret which we use for
SipHash calculations to mitigate attacks on the TCP hash table and initial
sequence number.

We have plans to use SipHash in places that aren't TCP related, and there's
no particular reason they'd need their own secret.  So move the hash_secret
to the general context structure.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:32 +01:00
David Gibson
705549f834 flow,tcp: Use epoll_ref type including flow and side
Currently TCP uses the 'flow' epoll_ref field for both connected
sockets and timers, which consists of just the index of the relevant
flow (connection).

This is just fine for timers, for while it obviously works, it's
subtly incomplete for sockets on spliced connections.  In that case we
want to know which side of the connection the event is occurring on as
well as which connection.  At present, we deduce that information by
looking at the actual fd, and comparing it to the fds of the sockets
on each side.

When we use the flow table for more things, we expect more cases where
something will need to know a specific side of a specific flow for an
event, but nothing more.

Therefore add a new 'flowside' epoll_ref field, with exactly that
information.  We use it for TCP connected sockets.  This allows us to
directly know the side for spliced connections.  For "tap"
connections, it's pretty meaningless, since the side is always the
socket side.  It still makes logical sense though, and it may become
important for future flow table work.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:24 +01:00
David Gibson
ecea8d36ff flow,tcp: Generalise TCP epoll_ref to generic flows
TCP uses three different epoll object types: one for connected sockets, one
for timers and one for listening sockets.  Listening sockets really need
information that's specific to TCP, so need their own epoll_ref field.
Timers and connected sockets, however, only need the connection (flow)
they're associated with.  As we expand the use of the flow table, we expect
that to be true for more epoll fds.  So, rename the "TCP" epoll_ref field
to be a "flow" epoll_ref field that can be used both for TCP and for other
future cases.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:20 +01:00
David Gibson
9d44aba7e0 util: MAX_FROM_BITS() should be unsigned
MAX_FROM_BITS() computes the maximum value representable in a number of
bits.  The expression for that is an unsigned value, but we explicitly cast
it to a signed int.  It looks like this is because one of the main users is
for FD_REF_MAX, which is used to bound fd values, typically stored as a
signed int.

The value MAX_FROM_BITS() is calculating is naturally non-negative, though,
so it makes more sense for it to be unsigned, and to move the case to the
definition of FD_REF_MAX.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:06 +01:00
David Gibson
f08ce92a13 flow, tcp: Move TCP connection table to unified flow table
We want to generalise "connection" tracking to things other than true TCP
connections.  Continue implenenting this by renaming the TCP connection
table to the "flow table" and moving it to flow.c.  The definitions are
split between flow.h and flow_table.h - we need this separation to avoid
circular dependencies: the definitions in flow.h will be needed by many
headers using the flow mechanism, but flow_table.h needs all those protocol
specific headers in order to define the full flow table entry.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-12-04 09:51:02 +01:00
David Gibson
732e249376 pif: Record originating pif in listening socket refs
For certain socket types, we record in the epoll ref whether they're
sockets in the namespace, or on the host.  We now have the notion of "pif"
to indicate what "place" a socket is associated with, so generalise the
simple one-bit 'ns' to a pif id.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:41 +01:00
David Gibson
dcf5c0eb1e port_fwd: Move port scanning /proc fds into struct port_fwd
Currently we store /proc/net fds used to implement automatic port
forwarding in the proc_net_{tcp,udp} fields of the main context structure.
However, in fact each of those is associated with a particular direction
of forwarding, and we already have struct port_fwd which collects all
other information related to a particular direction of port forwarding.

We can simplify things a bit by moving the /proc fds into struct port_fwd.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-11-07 09:53:29 +01:00
David Gibson
955dd3251c tcp, udp: Don't pre-fill IPv4 destination address in headers
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to.  So
we pre-fill this destination address in our header buffers for IPv4.  We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest.  In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.

Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf().  In fact for
TCP we already redundantly filled the destination for each packet anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-22 12:15:33 +02:00
David Gibson
ae5f6c8e1b epoll: Use different epoll types for passt and pasta tap fds
Currently we have a single epoll event type for the "tap" fd, which could
be either a handle on a /dev/net/tun device (pasta) or a connected Unix
socket (passt).  However for the two modes we call different handler
functions.  Simplify this a little by using different epoll types and
dispatching directly to the correct handler function.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:20 +02:00
David Gibson
eda4f1997e epoll: Split listening Unix domain socket into its own type
tap_handler() actually handles events on three different types of object:
the /dev/tap character device (pasta), a connected Unix domain socket
(passt) or a listening Unix domain socket (passt).

The last, in particular, really has no handling in common with the others,
so split it into its own epoll type and directly dispatch to the relevant
handler from the top level.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:17 +02:00
David Gibson
485b5fb8f9 epoll: Split handling of listening TCP sockets into their own handler
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common.  Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level.  Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning.  So,
switch listening sockets to their own reference type which we can lay out
as we please.  That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:15 +02:00
David Gibson
e6f81e5578 epoll: Split handling of TCP timerfds into its own handler function
tcp_sock_handler() actually handles several different types of fd events.
This includes timerfds that aren't sockets at all.  The handling of these
has essentially nothing in common with the other cases.  So, give the
TCP timers there own epoll_type value and dispatch directly to their
handler.  This also means we can remove the timer field from tcp_epoll_ref,
the information it encoded is now implicit in the epoll_type value.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:30:13 +02:00
David Gibson
6a6735ece4 epoll: Always use epoll_ref for the epoll data variable
epoll_ref contains a variety of information useful when handling epoll
events on our sockets, and we place it in the epoll_event data field
returned by epoll.  However, for a few other things we use the 'fd' field
in the standard union of types for that data field.

This actually introduces a bug which is vanishingly unlikely to hit in
practice, but very nasty if it ever did: theoretically if we had a very
large file descriptor number for fd_tap or fd_tap_listen it could overflow
into bits that overlap with the 'proto' field in epoll_ref.  With some
very bad luck this could mean that we mistakenly think an event on a
regular socket is an event on fd_tap or fd_tap_listen.

More practically, using different (but overlapping) fields of the
epoll_data means we can't unify dispatch for the various different objects
in the epoll.  Therefore use the same epoll_ref as the data for the tap
fds and the netns quit fd, adding new fd type values to describe them.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:53 +02:00
David Gibson
3401644453 epoll: Generalize epoll_ref to cover things other than sockets
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd.  However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future.  Rename these fields to
an abstract "fd type" and file descriptor for more generality.

Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space.  For now these just correspond to the
supported protocols, but we'll expand on that in future.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-13 17:29:51 +02:00
David Gibson
b15ce5b6ce Use static assertion to verify that union epoll_ref is the right size
union epoll_ref is used to subdivide the 64-bit data field in struct
epoll_event.  Thus it *must* fit within that field or we're likely to get
very subtle and nasty bugs.  C11 introduces the notion of static assertions
which we can use to verify this is the case at compile time.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:18:00 +02:00
David Gibson
8218d99013 Use C11 anonymous members to make poll refs less verbose to use
union epoll_ref has a deeply nested set of structs and unions to let us
subdivide it into the various different fields we want.  This means that
referencing elements can involve an awkward long string of intermediate
fields.

Using C11 anonymous structs and unions lets us do this less clumsily.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-08-04 01:17:57 +02:00
Stefano Brivio
9f61c5b68b passt.h: Fix description of pasta_ifi in struct ctx
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-05-23 16:13:28 +02:00
Stefano Brivio
cc9d16758b conf, pasta: With --config-net, copy all addresses by default
Use the newly-introduced NL_DUP mode for nl_addr() to copy all the
addresses associated to the template interface in the outer
namespace, unless --no-copy-addrs (also implied by -a) is given.

This option is introduced as deprecated right away: it's not expected
to be of any use, but it's helpful to keep it around for a while to
debug any suspected issue with this change.

This is done mostly for consistency with routes. It might partially
cover the issue at:
  https://bugs.passt.top/show_bug.cgi?id=47
  Support multiple addresses per address family

for some use cases, but not the originally intended one: we'll still
use a single outbound address (unless the routing table specifies
different preferred source addresses depending on the destination),
regardless of the address used in the target namespace.

Link: https://bugs.passt.top/show_bug.cgi?id=47
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-05-23 16:13:28 +02:00
Stefano Brivio
da54641f14 conf, pasta: With --config-net, copy all routes by default
Use the newly-introduced NL_DUP mode for nl_route() to copy all the
routes associated to the template interface in the outer namespace,
unless --no-copy-routes (also implied by -g) is given.

This option is introduced as deprecated right away: it's not expected
to be of any use, but it's helpful to keep it around for a while to
debug any suspected issue with this change.

Otherwise, we can't use default gateways which are not, address-wise,
on the same subnet as the container, as reported by Callum.

Reported-by: Callum Parsey <callum@neoninteger.au>
Link: https://github.com/containers/podman/issues/18539
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-05-23 16:13:28 +02:00
Stefano Brivio
ca2749e1bd passt: Relicense to GPL 2.0, or any later version
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.

Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.

Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
2023-04-06 18:00:33 +02:00
Chris Kuhn
5c58feab7b conf, passt: Rename stderr to force_stderr
While building against musl, gcc informs us that 'stderr' is a
protected keyword. This probably comes from a #define stderr (stderr)
in musl's stdio.h, to avoid a clash with extern FILE *const stderr,
but I didn't really track it down. Just rename it to force_stderr, it
makes more sense.

[sbrivio: Added commit message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:44:21 +01:00
Stefano Brivio
a9c59dd91b conf, icmp, tcp, udp: Add options to bind to outbound address and interface
I didn't notice earlier: libslirp (and slirp4netns) supports binding
outbound sockets to specific IPv4 and IPv6 addresses, to force the
source addresse selection. If we want to claim feature parity, we
should implement that as well.

Further, Podman supports specifying outbound interfaces as well, but
this is simply done by resolving the primary address for an interface
when the network back-end is started. However, since kernel version
5.7, commit c427bfec18f2 ("net: core: enable SO_BINDTODEVICE for
non-root users"), we can actually bind to a specific interface name,
which doesn't need to be validated in advance.

Implement -o / --outbound ADDR to bind to IPv4 and IPv6 addresses,
and --outbound-if4 and --outbound-if6 to bind IPv4 and IPv6 sockets
to given interfaces.

Given that it probably makes little sense to select addresses and
routes from interfaces different than the ones given for outbound
sockets, also assign those as "template" interfaces, by default,
unless explicitly overridden by '-i'.

For ICMP and UDP, we call sock_l4() to open outbound sockets, as we
already needed to bind to given ports or echo identifiers, and we
can bind() a socket only once: there, pass address (if any) and
interface (if any) for the existing bind() and setsockopt() calls.

For TCP, in general, we wouldn't otherwise bind sockets. Add a
specific helper to do that.

For UDP outbound sockets, we need to know if the final destination
of the socket is a loopback address, before we decide whether it
makes sense to bind the socket at all: move the block mangling the
address destination before the creation of the socket in the IPv4
path. This was already the case for the IPv6 path.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
2023-03-09 03:43:59 +01:00