We usually want to checksum only the tail part of a frame, excluding at
least some headers. csum_iov() does that for a frame represented as an
IO vector, not actually summing the entire IO vector. We now have struct
iov_tail to explicitly represent this construct, so replace csum_iov()
with csum_iov_tail() taking that representation rather than 3 parameters.
We propagate the same change to csum_udp4() and csum_udp6() which take
similar parameters. This slightly simplifies the code, and will allow some
further simplifications as struct iov_tail is more widely used.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add virtio and vhost-user functions to connect with QEMU.
$ ./passt --vhost-user
and
# qemu-system-x86_64 ... -m 4G \
-object memory-backend-memfd,id=memfd0,share=on,size=4G \
-numa node,memdev=memfd0 \
-chardev socket,id=chr0,path=/tmp/passt_1.socket \
-netdev vhost-user,id=netdev0,chardev=chr0 \
-device virtio-net,mac=9a:2b:2c:2d:2e:2f,netdev=netdev0 \
...
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: as suggested by lvivier, include <netinet/if_ether.h>
before including <linux/if_ether.h> as C libraries such as musl
__UAPI_DEF_ETHHDR in <netinet/if_ether.h> if they already have
a definition of struct ethhdr]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are setups where no host interface is available or configured
at all, intentionally or not, temporarily or not, but users expect
(Podman) containers to run in any case as they did with slirp4netns,
and we're now getting reports that we broke such setups at a rather
alarming rate.
To this end, if we don't find any usable host interface, instead of
exiting:
- for IPv4, use 169.254.2.1 as guest/container address and 169.254.2.2
as default gateway
- for IPv6, don't assign any address (forcibly disable DHCPv6), and
use the *first* link-local address we observe to represent the
guest/container. Advertise fe80::1 as default gateway
- use 'tap0' as default interface name for pasta
Change ifi4 and ifi6 in struct ctx to int and accept a special -1
value meaning that no host interface was selected, but the IP family
is enabled. The fact that the kernel uses unsigned int values for
those is not an issue as 1. one can't create so many interfaces
anyway and 2. we otherwise handle those values transparently.
Fix a botched conditional in conf_print() to actually skip printing
DHCPv6 information if DHCPv6 is disabled (and skip printing NDP
information if NDP is disabled).
Link: https://github.com/containers/podman/issues/24614
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I have no idea why, but these are reported by clang-tidy (19.2.1) on
Alpine (x86) only:
/home/sbrivio/passt/tap.c:1139:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1139 | int fd = socket(AF_UNIX, SOCK_STREAM, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tap.c:1158:51: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1158 | ex = socket(AF_UNIX, SOCK_STREAM | SOCK_NONBLOCK, 0);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/tcp.c:1413:44: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
1413 | s = socket(af, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP);
| ^
| | SOCK_CLOEXEC
/home/sbrivio/passt/util.c:188:38: error: 'socket' should use SOCK_CLOEXEC where possible [android-cloexec-socket,-warnings-as-errors]
188 | if ((s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) < 0) {
| ^
| | SOCK_CLOEXEC
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that we're comparing against 'n', which is signed, we cast
TAP_BUF_BYTES to ssize_t so that the maximum buffer usage, calculated
as the difference between TAP_BUF_BYTES and ETH_MAX_MTU, will also be
signed.
This doesn't necessarily happen on 32-bit architectures, though. On
armhf and i686, clang-tidy 18.1.8 and 19.1.2 report:
/home/pi/passt/tap.c:1087:16: error: comparison of integers of different signs: 'ssize_t' (aka 'int') and 'unsigned int' [clang-diagnostic-sign-compare,-warnings-as-errors]
1087 | for (n = 0; n <= (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
| ~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cast the whole difference to ssize_t, as we know it's going to be
positive anyway, instead of relying on that side effect.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On ppc64le, TUNSETIFF happens to be 2147767498, which is bigger than
INT_MAX (2^31 - 1), and musl declares the second argument of ioctl()
as 'int', not 'unsigned long' like glibc does, probably because of how
POSIX specifies the equivalent argument, int dcmd, in posix_devctl(),
so gcc reports a warning:
tap.c: In function 'tap_ns_tun':
tap.c:1291:24: warning: overflow in conversion from 'long unsigned int' to 'int' changes value from '2147767498' to '-2147199798' [-Woverflow]
1291 | rc = ioctl(fd, TUNSETIFF, &ifr);
| ^~~~~~~~~
We don't care about that overflow, so explicitly cast TUNSETIFF to
int.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy, starting from LLVM version 16, up to at least LLVM version
19, now checks that we detect and handle errors for snprintf() as
requested by CERT C rule ERR33-C. These warnings were logged with LLVM
version 19.1.2 (at least Debian and Fedora match):
/home/sbrivio/passt/arch.c:43:3: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
43 | snprintf(new_path, PATH_MAX + sizeof(".avx2"), "%s.avx2", exe);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/arch.c:43:3: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:577:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
577 | snprintf(netns, PATH_MAX, "/proc/%ld/ns/net", pidval);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/conf.c:577:4: note: cast the expression to void to silence this warning
/home/sbrivio/passt/conf.c:579:5: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
579 | snprintf(userns, PATH_MAX, "/proc/%ld/ns/user",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580 | pidval);
| ~~~~~~~
/home/sbrivio/passt/conf.c:579:5: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:105:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
105 | snprintf(ns, PATH_MAX, "/proc/%i/ns/net", pasta_child_pid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:105:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:242:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
242 | snprintf(uidmap, BUFSIZ, "0 %u 1", uid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:242:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/pasta.c:243:2: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
243 | snprintf(gidmap, BUFSIZ, "0 %u 1", gid);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/pasta.c:243:2: note: cast the expression to void to silence this warning
/home/sbrivio/passt/tap.c:1155:4: error: the value returned by this function should not be disregarded; neglecting it may lead to errors [cert-err33-c,-warnings-as-errors]
1155 | snprintf(path, UNIX_PATH_MAX - 1, UNIX_SOCK_PATH, i);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/sbrivio/passt/tap.c:1155:4: note: cast the expression to void to silence this warning
Don't silence the warnings as they might actually have some merit. Add
an snprintf_check() function, instead, checking that we're not
truncating messages while printing to buffers, and terminate if the
check fails.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As for tcp_update_check_tcp4()/tcp_update_check_tcp6(),
change csum_udp4() and csum_udp6() to use an iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full. However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer. If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around. This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass. This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.
Along the way handle a couple of extra edge cases:
- Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
ports where those might not have the same value
- Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all
three cases it returns from the function. That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again. For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN. We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
- Add structs for NA, RA, NS, MTU, prefix info, option header,
link-layer address, RDNSS, DNSSL and link-layer for RA message.
- Turn NA message from purely imperative, going byte by byte,
to declarative by filling it's struct.
- Turn part of RA message into declarative.
- Move packet_add() to be before the call of ndp() in tap6_handler()
if the protocol of the packet is ICMPv6.
- Add a pool of packets as an additional parameter to ndp().
- Check the size of NS packet with packet_get() before sending an NA
packet.
- Add documentation for the structs.
- Add an enum for NDP option types.
Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie. Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.
Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket. This handles a number of (unlikely) cases
which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the
remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
doing the partial frame handling, which would put us out of sync with
the stream from qemu
* If a the blocking recv() only received some of the remainder of the
frame, not all of it, we'd return leaving us out of sync with the
stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This
is probably acceptable because it's an unlikely case in practice. If
necessary we could mitigate this by using a true ring buffer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself. As far as I can tell,
frames can be any length, with no particular alignment requirement. This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.
Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading. Some platforms will generate a fatal trap on
such an unaligned load. Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.
Introduce a new helper to safely load a possibly unaligned value here. We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads. If that turns out not
to be the case, we can look at improvements then.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket. It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling. That consideration
doesn't apply to the way we handle the qemu socket however.
Furthermore, using EPOLLET causes additional complications:
1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
we *do* set it when using pasta mode with --fd. This inconsistency
doesn't seem to have broken anything, but it's odd.
2) EPOLLET requires that tap_handler_passt() loop until all data available
is read (otherwise we may have data in the buffer but never get an event
causing us to read it). We do that with a rather ugly goto.
Worse, our condition for that goto appears to be incorrect. We'll only
loop if rem is non-zero, which will only happen if we perform a blocking
recv() for a partially received frame. We'll only perform that second
recv() if the original recv() resulted in a partially read frame. As
far as I can tell the original recv() could end on a frame boundary
(never triggering the second recv()) even if there is additional data in
the socket buffer. In that circumstance we wouldn't goto redo and could
leave unprocessed frames in the qemu socket buffer indefinitely.
This doesn't seem to have caused any problems in practice, but since
there's no obvious reason to use EPOLLET here anyway, we might as well
get rid of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on. That sounds sensible on first blush, but
probably isn't wise in practice. If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket. Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error
and close the socket. As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error. Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.
Add an error message in this case. Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If we start pasta with some ports forwarded, but no --config-net, say:
$ ./pasta -u 10001
and then use a local, non-loopback address to send traffic to that
port, say:
$ socat -u FILE:test UDP4:192.0.2.1:10001
pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.
By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Now that we have logging functions embedding perror() functionality,
we can make _some_ calls more terse by using them. In many places,
the strerror() calls are still more convenient because, for example,
they are used in flow debugging functions, or because the return code
variable of interest is not 'errno'.
While at it, convert a few error messages from a scant perror style
to proper failure descriptions.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
buf_size is set to sizeof(pkt_buf) by default. And it seems more correct
to provide the actual size of the buffer.
Later a buf_size of 0 will allow vhost-user mode to detect
guest memory buffers.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As we are going to introduce the MODE_VU that will act like
the mode MODE_PASST, compare to MODE_PASTA rather than to add
a comparison to MODE_VU when we check for MODE_PASST.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(),
and tap4_handler() and tap6_handler() into tap_handler().
Create a generic tap_add_packet() to consolidate packet
addition logic and reduce code duplication.
The purpose is to ease the export of these functions to use
them with the vhost-user backend.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We globally disabled this, with a justification lumped together with
several checks about braces. They don't really go together, the others
are essentially a stylistic choice which doesn't match our style. Omitting
brackets on macro parameters can lead to real and hard to track down bugs
if an expression is ever passed to the macro instead of a plain identifier.
We've only gotten away with the macros which trigger the warning, because
of other conventions its been unlikely to invoke them with anything other
than a simple identifier. Fix the macros, and enable the warning for the
future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Otherwise, if the user runs us as root, and gives us paths that are
only accessible by root, we'll fail to open them, which might in turn
encourage users to change permissions or ownerships: definitely a bad
idea in terms of security.
Reported-by: Minxi Hou <mhou@redhat.com>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
We'll need to open and bind the socket a while before listening to it,
so split that into two different functions. No functional changes
intended.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
This is a remnant from the time we kept access to the original
filesystem and we could reinitialise the listening AF_UNIX socket.
Since commit 0515adceaa ("passt, pasta: Namespace-based sandboxing,
defer seccomp policy application"), however, we can't re-bind the
listening socket once we're up and running.
Drop the -1 initalisation and the corresponding check.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It has nothing to do with tap_sock_unix_init(). It used to be there as
that function could be called multiple times per passt instance, but
it's not the case anymore.
This also takes care of the fact that, with --fd, we wouldn't set the
initial MAC address, so we would need to wait for the guest to send us
an ARP packet before we could exchange data.
Fixes: 6b4e68383c ("passt, tap: Add --fd option")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Laurent's recent changes mean we use IO vectors much more heavily in the
TCP code. In many of those cases, and few others around the code base,
individual iovs of these vectors are constructed to exactly cover existing
variables or fields. We can make initializing such iovs shorter and
clearer with a macro for the purpose.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
At various points we need to track the lengths of a packet including or
excluding various different sets of headers. We don't always use the same
variable names for doing so. Worse in some places we use the same name
for different things: e.g. tcp_fill_headers[46]() use ip_len for the
length including the IP headers, but then tcp_send_flag() which calls it
uses it to mean the IP payload length only.
To improve clarity, standardise on these names:
dlen: L4 protocol payload length ("data length")
l4len: plen + length of L4 protocol header
l3len: l4len + length of IPv4/IPv6 header
l2len: l3len + length of L2 (ethernet) header
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
csum_ip4_header() takes the packet length as a network endian value. In
general it's very error-prone to pass non-native-endian values as a raw
integer. It's particularly bad here because this differs from other
checksum functions (e.g. proto_ipv4_header_psum()) which take host native
lengths.
It turns out all the callers have easy access to the native endian value,
so switch it to use host order like everything else.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In general, it's much less error-prone to have the endianness of values
implied by the type, rather than just noting it in comments. We can't
always easily avoid it, because C, but we can do so when possible. struct
in_addr and in6_addr are always encoded network endian, so noting it
explicitly isn't useful. Remove them.
In some cases we also have endianness notes on uint8_t parameters, which
doesn't make sense: for a single byte endianness is irrelevant. Remove
those too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Most times we send frames to the guest it goes via tap_send_frames().
However "slow path" protocols - ARP, ICMP, ICMPv6, DHCP and DHCPv6 - go
via tap_send().
As well as being a semantic duplication, tap_send() contains at least one
serious problem: it doesn't properly handle short sends, which can be fatal
on the qemu socket connection, since frame boundaries will get out of sync.
Rewrite tap_send() to call tap_send_frames(). While we're there, rename it
tap_send_single() for clarity.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can both remove some variables which differ from others only in type,
and slightly improve type safety.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_send_frames() takes a vector of buffers and requires exactly one frame
per buffer. We have future plans where we want to have multiple buffers
per frame in some circumstances, so extend tap_send_frames() to take the
number of buffers per frame as a parameter.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Improve comment to rembufs calculation]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tap_send_frames(), if we failed to send all the frames, we must
only log the frames that have been sent, not all the frames we wanted
to send.
Fixes: dda7945ca9 ("pcap: Handle short writes in pcap_frame()")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Use ethhdr rather than tap_hdr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-9-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We can find the same function to compute the IPv4 header
checksum in tcp.c, udp.c and tap.c
Use the function defined for tap.c, csum_ip4_header(), but
with the code used in tcp.c and udp.c as it doesn't need a fully
initialiazed IPv4 header, only protocol, tot_len, saddr and daddr.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-7-lvivier@redhat.com>
[dwg: Fix weird cppcheck regression; it appears to be a problem
in pre-existing code, but somehow this patch is exposing it]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures. Modify various files to include the new header ip.h when
it's needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The "tap" interface, whether it's actually a tuntap device or a qemu
socket, presents a virtual external link between different network hosts.
Hence, loopback addresses make no sense there. However, nothing prevents
the guest from putting bogus packets with loopback addresses onto the
interface and it's not entirely clear what effect that will have on passt.
Explicitly test for such packets and drop them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we determine we have sent a partial frame in tap_send_frames_passt(),
we call tap_send_remainder() to send the remainder of it. The logic in
that function is very similar to that in the more general write_remainder()
except that it uses send() instead of write()/writev(). But we are dealing
specifically with the qemu socket here, which is a connected stream socket.
In that case write()s do the same thing as send() with the options we were
using, so we can just reuse write_remainder().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently pcap_frame() assumes that if write() doesn't return an error, it
has written everything we want. That's not necessarily true, because it
could return a short write. That's not likely to happen on a regular file,
but there's not a lot of reason not to be robust here; it's conceivable we
might want to direct the pcap fd at a named pipe or similar.
So, make pcap_frame() handle short frames by using the write_remainder()
helper.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Formatting fix, and avoid gcc warning in pcap_frame()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Several of the IOV functions in iov.c, and also tap_send_frames_passt()
needs to determine which buffer element a byte offset into an IO vector
lies in. Split this out into a helper function iov_skip_bytes().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Types size_t and ssize_t are not necessarily long, it depends on the
architecture.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
tap_send_frames_pasta() sends frames to the namespace by sending them to
our the /dev/tap device. If that write() returns an error, we already
handle it. However we don't handle the case where the write() returns
short, meaning we haven't successfully transmitted the whole frame.
I don't know if this can ever happen with the kernel tap device, but we
should at least report the case so we don't get a cryptic failure. For
the purposes of the return value for tap_send_frames_pasta() we treat this
case as though it was an error (on the grounds that a partial frame is no
use to the namespace).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>