With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have an ASSERT() verifying that we're able to look up the flow in
udp_reply_sock_handler(). However, we dereference uflow before that in
an initializer, rather defeating the point. Rearrange to avoid that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_init() and udp_sock_init() take an address to bind to as an
address family and void * pair. Use an inany instead. Formerly AF_UNSPEC
was used to indicate that we want to listen on both 0.0.0.0 and ::, now use
a NULL inany to indicate that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The sock_l4() function is very convenient for creating sockets bound to
a given address, but its interface has some problems.
Most importantly, the address and port alone aren't enough in some cases.
For link-local addresses (at least) we also need the pif in order to
properly construct a socket adddress. This case doesn't yet arise, but
it might cause us trouble in future.
Additionally, sock_l4() can take AF_UNSPEC with the special meaning that it
should attempt to create a "dual stack" socket which will respond to both
IPv4 and IPv6 traffic. This only makes sense if there is no specific
address given. We verify this at runtime, but it would be nicer if we
could enforce it structurally.
For sockets associated specifically with a single flow we already replaced
sock_l4() with flowside_sock_l4() which avoids those problems. Now,
replace all the remaining users with a new pif_sock_l4() which also takes
an explicit pif.
The new function takes the address as an inany *, with NULL indicating the
dual stack case. This does add some complexity in some of the callers,
however future planned cleanups should make this go away again.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
To save some kernel memory we try to use "dual stack" sockets (that listen
to both IPv4 and IPv6 traffic) when possible. However udp_sock_init()
attempts to do this in some cases that can't work. Specifically we can
only do this when listening on any address. That's never true for the
ns (splicing) case, because we always listen on loopback. For the !ns
case and AF_UNSPEC case, addr should always be NULL, but add an assert to
verify.
This is harmless: if addr is non-NULL, sock_l4() will just fail and we'll
fall back to the other path. But, it's messy and makes some upcoming
changes harder, so avoid attempting this in cases we know can't work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We can need not to set the UDP checksum. Add a parameter to
udp_update_hdr4() and udp_update_hdr6() to disable it.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
cppcheck-2.15.0 has apparently broadened when it throws a warning about
redundant initialization to include some cases where we have an initializer
for some fields, but then set other fields in the function body.
This is arguably a false positive: although we are technically overwriting
the zero-initialization the compiler supplies for fields not explicitly
initialized, this sort of construct makes sense when there are some fields
we know at the top of the function where the initializer is, but others
that require more complex calculation.
That said, in the two places this shows up, it's pretty easy to work
around. The results are arguably slightly clearer than what we had, since
they move the parts of the initialization closer together.
So do that rather than having ugly suppressions or dealing with the
tedious process of reporting a cppcheck false positive.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_errs() reads out everything in the socket error queue. However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.
One possibility is that the error is reported instead by the SO_ERROR
sockopt. Check for that case and report it as best we can. If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.
[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010
Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can get network errors, usually transient, reported via the socket error
queue. However, at least theoretically, we could get errors trying to
read the queue itself. Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding. That made sense initially, since
both listening and reply sockets need to do this. However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.
So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors". We don't
have any of these so far, but some cases we're working on might require it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections. By doing this we halve the number of
listening sockets we need (assuming passt/pasta is listening on the same
ports for IPv4 and IPv6). When forwarding many ports (e.g. -u all) this
can significantly reduce the amount of kernel memory that passt consumes.
We've used such dual stack sockets for TCP since 8e914238b "tcp: Use dual
stack sockets for port forwarding when possible". Add similar support for
UDP "listening" sockets. Since UDP sockets don't use as much kernel memory
as TCP sockets this isn't as big a saving, but it's still significant.
When forwarding all TCP and UDP ports for both IPv4 & IPv6 (-t all -u all),
this reduces kernel memory usage from ~522 MiB to ~380MiB (kernel version
6.10.6 on Fedora 40, x86_64).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The 's' variable is always redundant with either 'r4' or 'r6', so remove
it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We've already gotten rid of most of the IPv4/IPv6 specific data structures
in udp.c by merging them with each other. One significant one remains:
udp[46]_mh_recv. This was a bit awkward to remove because of a subtle
interaction. We initialise the msg_namelen fields to represent the total
size we have for a socket address, but when we receive into the arrays
those are modified to the actual length of the sockaddr we received.
That meant that naively merging the arrays meant that if we received IPv4
datagrams, then IPv6 datagrams, the addresses for the latter would be
truncated. In this patch address that by resetting the received
msg_namelen as soon as we've found a flow for the datagram. Finding the
flow is the only thing that might use the actual sockaddr length, although
we in fact don't need it for the time being.
This also removes the last use of the 'v6' field from udp_listen_epoll_ref,
so remove that as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These system calls are needed after the conversion of time_t to 64-bit
types on 32-bit architectures.
Tested by running some transfer tests with passt and pasta on Debian
Bookworm (glibc 2.36) and Trixie (glibc 2.39), running on armv6l.
Suggested-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
I haven't tested i386 for a long time (after playing with some
openSUSE i586 image a couple of years ago). It turns out that a number
of system calls we actually need were denied by the seccomp filter,
and not even basic functionality works.
Add some system calls that glibc started using with the 64-bit time
("t64") transition, see also:
https://wiki.debian.org/ReleaseGoals/64bit-time
that is: clock_gettime64, timerfd_gettime64, fcntl64, and
recvmmsg_time64.
Add further system calls that are needed regardless of time_t width,
that is, mmap2 (valgrind profile only), _llseek and sigreturn (common
outside x86_64), and socketcall (same as s390x).
I validated this against an almost full run of the test suite, with
just a few selected tests skipped. Fixes needed to run most tests on
i386/i686, and other assorted fixes for tests, are included in
upcoming patches.
Reported-by: Uroš Knupleš <uros@knuples.net>
Analysed-by: Faidon Liambotis <paravoid@debian.org>
Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1078981
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
No code change.
They need to be exported to be available by the vhost-user version of
passt.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
EPOLL_TYPE_UDP is now only used for "listening" sockets; long lived
sockets which can initiate new flows. Rename to EPOLL_TYPE_UDP_LISTEN
and associated functions to match. Along with that, remove the .orig
field from union udp_listen_epoll_ref, since it is now always true.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that UDP datagrams are all directed via the flow table, we no longer
use the udp_tap_map[ or udp_act[] arrays. Remove them and connected
code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This replaces the last piece of existing UDP port tracking with the
common flow table. Specifically use the flow table to direct datagrams
from host sockets to the guest tap interface. Since this now requires
a flow for every datagram, we add some logging if we encounter any
datagrams for which we can't find or create a flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we create flows for datagrams from socket interfaces, and use
them to direct "spliced" (socket to socket) datagrams. We don't yet
match datagrams from the tap interface to existing flows, nor create new
flows for them. Add that functionality, matching datagrams from tap to
existing flows when they exist, or creating new ones.
As with spliced flows, when creating a new flow from tap to socket, we
create a new connected socket to receive reply datagrams attached to that
flow specifically. We extend udp_flow_sock_handler() to handle reply
packets bound for tap rather than another socket.
For non-obvious reasons (perhaps increased stack usage?), this caused
a failure for me when running under valgrind, because valgrind invoked
rt_sigreturn which is not in our seccomp filter. Since we already
allow rt_sigaction and others in the valgrind target, it seems
reasonable to add rt_sigreturn as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that spliced datagrams are managed via the flow table, remove
UDP_ACT_SPLICE_NS and UDP_ACT_SPLICE_INIT which are no longer used. With
those removed, the 'ts' field in udp_splice_port is also no longer used.
struct udp_splice_port now contains just a socket fd, so replace it with
a plain int in udp_splice_ns[] and udp_splice_init[]. The latter are still
used for tracking of automatic port forwarding.
Finally, the 'splice' field of union udp_epoll_ref is no longer used so
remove it as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When forwarding a datagram to a socket, we need to find a socket with a
suitable local address to send it. Currently we keep track of such sockets
in an array indexed by local port, but this can't properly handle cases
where we have multiple local addresses in active use.
For "spliced" (socket to socket) cases, improve this by instead opening
a socket specifically for the target side of the flow. We connect() as
well as bind()ing that socket, so that it will only receive the flow's
reply packets, not anything else. We direct datagrams sent via that socket
using the addresses from the flow table, effectively replacing bespoke
addressing logic with the unified logic in fwd.c
When we create the flow, we also take a duplicate of the originating
socket, and use that to deliver reply datagrams back to the origin, again
using addresses from the flow table entry.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This implements the first steps of tracking UDP packets with the flow table
rather than its own (buggy) set of port maps. Specifically we create flow
table entries for datagrams received from a socket (PIF_HOST or
PIF_SPLICE).
When splitting datagrams from sockets into batches, we group by the flow
as well as splicesrc. This may result in smaller batches, but makes things
easier down the line. We can re-optimise this later if necessary. For now
we don't do anything else with the flow, not even match reply packets to
the same flow.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we ignore all events other than EPOLLIN on UDP sockets. This
means that if we ever receive an EPOLLERR event, we'll enter an infinite
loop on epoll, because we'll never do anything to clear the error.
Luckily that doesn't seem to have happened in practice, but it's certainly
fragile. Furthermore changes in how we handle UDP sockets with the flow
table mean we will start receiving error events.
Add handling of EPOLLERR events. For now we just read the error from the
error queue (thereby clearing the error state) and print a debug message.
We can add more substantial handling of specific events in future if we
want to.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We abort the UDP socket handler if the no_udp flag is set. But if UDP
was disabled we should never have had a UDP socket to trigger the handler
in the first place. If we somehow did, ignoring it here isn't really going
to help because aborting without doing anything is likely to lead to an
epoll loop. The same is the case for the TCP socket and timer handlers and
the no_tcp flag.
Change these checks on the flag to ASSERT()s. Similarly add ASSERT()s to
several other entry points to the protocol specific code which should never
be called if the protocol is disabled.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Through an oversight this was previously declared as a public function
although it's only used in udp.c and there is no prototype in any header.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we receive datagrams on a socket, we need to split them into batches
depending on how they need to be forwarded (either via a specific splice
socket, or via tap). The logic to do this, is somewhat awkwardly split
between udp_buf_sock_handler() itself, udp_splice_send() and
udp_tap_send().
Move all the batching logic into udp_buf_sock_handler(), leaving
udp_splice_send() to just send the prepared batch. udp_tap_send() reduces
to just a call to tap_send_frames() so open-code that call in
udp_buf_sock_handler().
This will allow separating the batching logic from the rest of the datagram
forwarding logic, which we'll need for upcoming flow table support.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_sock_handler(), udp_splice_send() and udp_tap_send loosely, do four
things between them:
1. Receive some datagrams from a socket
2. Split those datagrams into batches depending on how they need to be
sent (via tap or via a specific splice socket)
3. Prepare buffers for each datagram to send it onwards
4. Actually send it onwards
Split (1) and (3) into specific helper functions. This isn't
immediately useful (udp_splice_prepare(), in particular, is trivial),
but it will make further reworks clearer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since we split our packet frame buffers into different pieces, we have
a single buffer per IP version for the ethernet header, rather than one
per frame. This makes sense since our ethernet header is alwaus the same.
However we initialise those buffers udp[46]_eth_hdr inside a per frame
loop. Pull that outside the loop so we just initialise them once.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The only differences between these arrays are that udp4_l2_iov is
pre-initialised to point to the IPv4 ethernet header, and IPv4 per-frame
header and udp6_l2_iov points to the IPv6 versions.
We already have to set up a bunch of headers per-frame, including updating
udp[46]_l2_iov[i][UDP_IOV_PAYLOAD].iov_len. It makes more sense to adjust
the IOV entries to point at the correct headers for the frame than to have
two complete sets of iovecs.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have separate mmsghdr arrays for splicing IPv4 and IPv6 packets, where
the only difference is that they point to different sockaddr buffers for
the destination address.
Unify these by having the common array point at a sockaddr_inany as the
address. This does mean slightly more work when we're about to splice,
because we need to write the whole socket address, rather than just the
port. However it removes 32 mmsghdr structures and we're going to need
more flexibility constructing that target address for the flow table.
Because future changes might mean that the address isn't always loopback,
change the name of the common address from *_localname to udp_splicename.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Make the salient points about these various arrays clearer with renames:
* udp_l2_iov_sock and udp[46]_l2_mh_sock don't really have anything to do
with L2. They are, however, specific to receiving not sending. Rename
to udp_iov_recv and udp[46]_mh_recv.
* udp[46]_l2_iov_tap is redundant - "tap" implies L2 and vice versa.
Rename to udp[46]_l2_iov
* udp[46]_localname are (for now) pre-populated with the local address but
the more salient point is that these are the destination address for the
splice arrays. Rename to udp[46]_splice_to
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_buf_sock_handler() takes the epoll reference from the receiving socket,
and passes the UDP relevant part on to several other functions. Future
changes are going to need several different epoll types for UDP, and to
pass that information through to some of those functions. To avoid extra
noise in the patches making the real changes, change those functions now
to take the full epoll reference, rather than just the UDP part.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
sock_l4() creates a socket of the given IP protocol number, and adds it to
the epoll state. Currently it determines the correct tag for the epoll
data based on the protocol. However, we have some future cases where we
might want different semantics, and therefore epoll types, for sockets of
the same protocol. So, change sock_l4() to take the epoll type as an
explicit parameter, and determine the protocol from that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
cppcheck 2.14 warns that the scope of the rport variable could be
reduced: do that, as reverted commit c80fa6a6bb ("udp: Make rport
calculation more local") did, but keep the temporary variable of
in_port_t type, otherwise the sum gets promoted to int.
While at it, add a comment explaining why we calculate rport like
this instead of directly using the sum as array index.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
This reverts commit c80fa6a6bb, as it
reintroduces the issue fixed by commit 1e6f92b995 ("udp: Fix 16-bit
overflow in udp_invert_portmap()").
Reported-by: Laurent Jacquot <jk@lutty.net>
Link: https://bugs.passt.top/show_bug.cgi?id=80
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Mostly, udp_sock_handler() is independent of how the datagrams it processes
will be forwarded (tap or splice). However, it also updates the msg_name
fields for spliced sends, which doesn't really make sense here. Move it
into udp_splice_send() which is all about spliced sends. This does
potentially mean we'll update the field to the same value several times,
but we're going to need this in future anyway: with the extensions the
flow table allows, it might not be the same value each time after all.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_sock_handler() takes a number of datagrams from sockets that depending
on their addresses could be forwarded either to the L2 interface ("tap")
or to another socket ("spliced"). In the latter case we can also only
send packets together if they have the same source port, and therefore
are sent via the same socket.
To reduce the total number of system calls we gather contiguous batches of
datagrams with the same destination interface and socket where applicable.
The determination of what the target is is made by udp_mmh_splice_port().
It returns the source port for splice packets and -1 for "tap" packets.
We find batches by looking ahead in our queue until we find a datagram
whose "splicefrom" port doesn't match the first in our current batch.
udp_mmh_splice_port() is moderately expensive, and unfortunately we
can call it twice on the same datagram: once as the (last + 1) entry
in one batch (to check it's not in that batch), then again as the
first entry in the next batch.
Avoid this by keeping track of the "splice port" in the metadata structure,
and filling it in one entry ahead of the one we're currently considering.
This is a bit subtle, but not that hard. It will also generalise better
when we have more complex possibilities based on the flow table.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
udp_mmh_splice_port() is used to determine if a UDP datagram can be
"spliced" (forwarded via a socket instead of tap). We only invoke it if
the origin socket has the 'splice' flag set.
Fold the checking of the flag into the helper itself, which makes the
caller simpler. It does mean we have a loop looking for a batch of
spliceable or non-spliceable packets even in the case where the flag is
clear. This shouldn't be that expensive though, since each call to
udp_mmh_splice_port() will return without accessing memory in that case.
In any case we're going to need a similar loop in more cases with upcoming
flow table work.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We are going to introduce a variant of the function to use
vhost-user buffers rather than passt internal buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions
to improve code portability by replacing the udp_meta_t parameter with
more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr)
and the source socket address (sockaddr_in/sockaddr_in6).
It also moves the tap_hdr_update() function call inside the udp_tap_send()
function not to have to pass the TAP header to udp_update_hdr4() and
udp_update_hdr6()
This refactor reduces complexity by making the functions more modular and
ensuring that each function operates on more narrowly scoped data structures.
This will facilitate future backend introduction like vhost-user.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>