If libvirtd is triggered by an unprivileged user, the virt-aa-helper
mechanism doesn't work, because per-VM profiles can't be instantiated,
and as a result libvirtd runs unconfined.
This means passt can't start, because the passt subprofile from
libvirt's profile is not loaded either.
Example:
$ virsh start alpine
error: Failed to start domain 'alpine'
error: internal error: Child process (passt --one-off --socket /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0.socket --pid /run/user/1000/libvirt/qemu/run/passt/1-alpine-net0-passt.pid --tcp-ports 40922:2) unexpected fatal signal 11
Add an annoying workaround for the moment being. Much better than
encouraging users to start guests as root, or to disable AppArmor
altogether.
Reported-by: Prafulla Giri <prafulla.giri@protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...perhaps I should adopt the healthy habit of actually reading
headers instead of using my mental copy.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When building against musl headers:
- sizeof() needs stddef.h, as it should be;
- we can't initialise a struct msghdr by simply listing fields in
order, as they contain explicit padding fields. Use field names
instead.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When returning from main it does the same as calling exit() which is not
good as glibc might try to call futex() which will be blocked by
seccomp. See the prevoius commit "treewide: use _exit() over exit()" for
a more detailed explanation.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the podman CI I noticed many seccomp denials in our logs even though
tests passed:
comm="pasta.avx2" exe="/usr/bin/pasta.avx2" sig=31 arch=c000003e
syscall=202 compat=0 ip=0x7fb3d31f69db code=0x80000000
Which is futex being called and blocked by the pasta profile. After a
few tries I managed to reproduce locally with this loop in ~20 min:
while :;
do podman run -d --network bridge quay.io/libpod/testimage:20241011 \
sleep 100 && \
sleep 10 && \
podman rm -fa -t0
done
And using a pasta version with prctl(PR_SET_DUMPABLE, 1); set I got the
following stack trace:
Stack trace of thread 1:
#0 0x00007fc95e6de91b __lll_lock_wait_private (libc.so.6 + 0x9491b)
#1 0x00007fc95e68d6de __run_exit_handlers (libc.so.6 + 0x436de)
#2 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#3 0x000055f31b78c50b n/a (n/a + 0x0)
#4 0x00007fc95e68d70e exit (libc.so.6 + 0x4370e)
#5 0x000055f31b78d5a2 n/a (n/a + 0x0)
Pasta got killed in exit(), it seems glibc is trying to use a lock when
running exit handlers even though no exit handlers are defined.
Given no exit handlers are needed we can call _exit() instead. This
skips exit handlers and does not flush stdio streams compared to exit()
which should be fine for the use here.
Based on the input from Stefano I did not change the test/doc programs
or qrap as they do not use seccomp filters.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For migration we need to get the specific local address and port for
connected sockets with getsockname(). We currently open code marshalling
the results into the flow entry.
However, we already have inany_from_sockaddr() which handles the fiddly
parts of this, so use it. Also report failures, which may make debugging
problems easier.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Drop re-declarations of 'sa' and 'sl']
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The passt-repair helper is now merged, but alas it contains several small
bugs:
* close() is not in the seccomp profile, meaning it will immediately
SIGSYS when you make a request of it
* The generated header, seccomp_repair.h isn't listed in .gitignore or
removed by "make clean"
Fixes: 8c24301462c3 ("Introduce passt-repair")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
For migration only: we need to store 'oport', our socket-side port,
as we establish a connection from the guest, so that we can bind the
same oport as source port in the migration target.
Similar for 'oaddr': this is needed in case the migration target has
additional network interfaces, and we need to make sure our socket is
bound to the equivalent interface as it was on the source.
Use getsockname() to fetch them.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Having every vhost-user message printed as part of debug output makes
debugging anything else a bit complicated.
Change per-packet debug() messages in vu_kick_cb() and
vu_send_single() to trace()
[dgibson: switch different messages to trace()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These are symmetric to write_remainder() and write_all_buf() and
almost a copy and paste of them, with the most notable differences
being reversed reads/writes and a couple of better-safe-than-sorry
asserts to keep Coverity happy.
I'll use them in the next patch. At least for the moment, they're
going to be used for vhost-user mode only, so I'm not unconditionally
enabling readv() in the seccomp profile: the caller has to ensure it's
there.
[dgibson: make read_remainder() take const pointer to iovec]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I explicitly added fcntl64() to the list of allowed system calls for
PPC64 a while ago, and now it turns out it's not available in recent
Debian builds. The warning from seccomp.sh is harmless because we
unconditionally try to enable fcntl() anyway, but take care of it
anyway.
Link: https://buildd.debian.org/status/fetch.php?pkg=passt&arch=ppc64&ver=0.0%7Egit20250121.4f2c8e7-1&stamp=1737477147&raw=0
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reported by somebody on IRC: if the server has considerable latency,
it might happen that the client retries sending SYN segments for the
same flow while we're still in a TAP_SYN_RCVD, non-ESTABLISHED state.
In that case, we should go with the blanket assumption that we need
to reset the connection on any unexpected segment: RFC 9293 explicitly
mentions this case in Figure 8: Recovery from Old Duplicate SYN,
section 3.5. It doesn't make sense for us to set a specific sequence
number, socket-side, but we should definitely wait and see.
Ignoring the duplicate SYN segment should also be compatible with
section 3.10.7.3. SYN-SENT STATE, which mentions updating sequences
socket-side (which we can't do anyway), but certainly not reset the
connection.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
On Fedora 41, without "allow pasta_t unconfined_t:dir read"
/usr/bin/pasta can't open /proc/[pid]/ns, which is required by
pasta_netns_quit_init().
This patch also remove one duplicate rule "allow pasta_t nsfs_t:file
read;", "allow pasta_t nsfs_t:file { open read };" at line 123 is
enough.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Those are symmetric to TAPSIDE(x)/TAPFLOW(x) and I'll use them in
the next patch to extract 'oport' in order to re-bind sockets to
the original socket-side local port.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
vu_remove_watch() is used in vhost_user.c to remove an fd from the global
epoll set. There's nothing really vhost user specific about it though,
so rename, move to util.c and use it in a bunch of places outside
vhost_user.c where it makes things marginally more readable.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_epoll_ctl() we pass an event pointer with EPOLL_CTL_DEL, even though
it will be ignored. It's possible this was a workaround for pre-2.6.9
kernels which required a non-NULL pointer here, but we rely on the kernel
accepting NULL events for EPOLL_CTL_DEL in lots of other places. Use
NULL instead for simplicity and consistency.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Passt cannot manage and doesn't need to manage the broadcast of a fake RARP,
but QEMU will report an error message if Passt doesn't implement it.
Implement an empty SEND_RARP command to silence QEMU error message.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There might be reasons to have routes on the loopback interface, for
example Any-IP/AnyIP routes as implemented by Linux kernel commit
ab79ad14a2d5 ("ipv6: Implement Any-IP support for IPv6.").
If we use the loopback interface as a template, though, we'll pick
'lo' (typically) as interface name for our tap interface, but we'll
already have an interface called 'lo' in the target namespace, and as
we TUNSETIFF on it, we'll fail with EINVAL, because it's not a tap
interface.
Skip the loopback interface while looking for a template interface or,
more accurately, skip the interface with index 1.
Strictly speaking, we should fetch interface flags via RTM_GETLINK
instead, and check for IFF_LOOPBACK, but interleaving that request
while we're iterating over routes is unnecessarily complicated.
Link: https://www.reddit.com/r/podman/comments/1i6pj7u/starting_pod_without_external_network/
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If the iovec array cannot be managed, drop it rather than
passing the second entry to tap_add_packet().
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
So far we omitted setting PSH flags for inbound traffic altogether: as
we ignore the nature of the data we're sending, we can't conclude that
some data is more or less urgent. This works fine with Linux guests,
as the Linux kernel doesn't do much with it, on input: it will
generally deliver data to the application layer without delay.
However, with Windows, things change: if we don't set the PSH flag on
interactive inbound traffic, we can expect long delays before the data
is delivered to the application.
This is very visible with RDP, where packets we send on behalf of the
RDP client are delivered with delays exceeding one second:
$ tshark -r rdp.pcap -td -Y 'frame.number in { 33170 .. 33173 }' --disable-protocol tls
33170 0.030296 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285229 Win=387968 Len=0
33171 0.985412 88.198.0.164 → 93.235.154.248 105 TCP 3389 → 49012 [PSH, ACK] Seq=285229 Ack=13820 Win=63198 Len=51
33172 0.030373 93.235.154.248 → 88.198.0.164 54 TCP 49012 → 3389 [ACK] Seq=13820 Ack=285280 Win=387968 Len=0
33173 1.383776 88.198.0.164 → 93.235.154.248 424 TCP 3389 → 49012 [PSH, ACK] Seq=285280 Ack=13820 Win=63198 Len=370
in this example (packet capture taken by passt), frame #33172 is a
mouse event sent by the RDP client, and frame #33173 is the first
event (display reacting to click) sent back by the server. This
appears as a 1.4 s delay before we get frame #33173.
If we set PSH, instead:
$ tshark -r rdp_psh.pcap -td -Y 'frame.number in { 314 .. 317 }' --disable-protocol tls
314 0.002503 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7779 Ack=74047 Win=31872 Len=116
315 0.000557 88.198.0.164 → 93.235.154.248 54 TCP 3389 → 51066 [ACK] Seq=79162 Ack=7895 Win=62872 Len=0
316 0.012752 93.235.154.248 → 88.198.0.164 170 TCP 51066 → 3389 [PSH, ACK] Seq=7895 Ack=79162 Win=31872 Len=116
317 0.011927 88.198.0.164 → 93.235.154.248 107 TCP 3389 → 51066 [PSH, ACK] Seq=79162 Ack=8011 Win=62756 Len=53
here, in frame #316, our mouse event is delivered without a delay and
receives a response in approximately 12 ms.
Set PSH on the last segment for any batch we dequeue from the socket,
that is, set it whenever we know that we might not be sending data to
the same port for a while.
Reported-by: NN708
Link: https://bugs.passt.top/show_bug.cgi?id=107
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Somewhat curiously, RFC 9293, section 3.10.7.3, states:
If the state is SYN-SENT, then
[...]
Second, check the RST bit:
- If the RST bit is set,
[...]
o If the ACK was acceptable, then signal to the user "error:
connection reset", drop the segment, enter CLOSED state,
delete TCB, and return. Otherwise (no ACK), drop the
segment and return.
which matches verbatim RFC 793, pages 66-67, and is implemented as-is
by tcp_rcv_synsent_state_process() in the Linux kernel, that is:
/* No ACK in the segment */
if (th->rst) {
/* rfc793:
* "If the RST bit is set
*
* Otherwise (no ACK) drop the segment and return."
*/
goto discard_and_undo;
}
meaning that if a client is in SYN-SENT state, and we send a RST
segment once we realise that we can't establish the outbound
connection, the client will ignore our segment and will need to
pointlessly wait until the connection times out instead of aborting
it right away.
The ACK flag on a RST, in this case, doesn't really seem to have any
function, but we must set it nevertheless. The ACK sequence number is
already correct because we always set it before calling
tcp_prepare_flags(), whenever relevant.
This leaves us with no cases where we should *not* set the ACK flag
on non-SYN segments, so always set the ACK flag for RST segments.
Note that non-SYN, non-RST segments were already covered by commit
4988e2b40631 ("tcp: Unconditionally force ACK for all !SYN, !RST
packets").
Reported-by: Dirk Janssen <Dirk.Janssen@schiphol.nl>
Reported-by: Roeland van de Pol <Roeland.van.de.Pol@schiphol.nl>
Reported-by: Robert Floor <Robert.Floor@schiphol.nl>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Following up on 725acd111ba3 ("tcp_splice: Set (again) TCP_NODELAY on
both sides"), David argues that, in general, we don't know what kind
of TCP traffic we're dealing with, on any side or path.
TCP segments might have been delivered to our socket with a PSH flag,
but we don't have a way to know about it.
Similarly, the guest might send us segments with PSH or URG set, but
we don't know if we should generally TCP_CORK sockets and uncork on
those flags, because that would assume they're running a Linux kernel
(and a particular version of it) matching the kernel that delivers
outbound packets for us.
Given that we can't make any assumption and everything might very well
be interactive traffic, disable Nagle's algorithm on all non-spliced
sockets as well.
After all, John Nagle himself is nowadays recommending that delayed
ACKs should never be enabled together with his algorithm, but we
don't have a practical way to ensure that our environment is free from
delayed ACKs (TCP_QUICKACK is not really usable for this purpose):
https://news.ycombinator.com/item?id=34180239
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...so it's pointless to set SO_RCVBUF and SO_SNDBUF on listening
sockets.
Call tcp_sock_set_bufsize() after accept4(), for inbound sockets.
As we didn't have large buffer sizes set for inbound sockets for
a long time (they are set explicitly only if the maximum size is
big enough, more than than the ~200 KiB default), I ran some more
throughput tests for this one, and I see slightly better numbers
(say, 17 gbps instead of 15 gbps guest to host without vhost-user).
Fixes: 904b86ade7db ("tcp: Rework window handling, timers, add SO_RCVLOWAT and pools for sockets/pipes")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Replace ASSERT() on the number of iovec in the element and on
the first entry length by a debug() message.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix typo in failure message]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Report to front-end that we support device state commands:
VHOST_USER_CHECK_DEVICE_STATE
VHOST_USER_SET_LOG_BASE
These feature is needed to transfer backend state using frontend
channel.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Set the file descriptor to use to transfer the
backend device state during migration.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fixed nits and coding style here and there]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
After transferring the back-end’s internal state during migration,
check whether the back-end was able to successfully fully process
the state.
The value returned indicates success or error;
0 is success, any non-zero value is an error.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This features allows QEMU to be migrated. We need also to report
VHOST_F_LOG_ALL.
This protocol feature reports we can log the page update and
implement VHOST_USER_SET_LOG_BASE and VHOST_USER_SET_LOG_FD.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sets logging shared memory space.
When the back-end has VHOST_USER_PROTOCOL_F_LOG_SHMFD protocol feature,
the log memory fd is provided in the ancillary data of
VHOST_USER_SET_LOG_BASE message, the size and offset of shared memory
area provided in the message.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix coding style in a bunch of places]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
vu_dev will be needed to log page update.
Add the parameter to:
vring_used_write()
vu_queue_fill_by_index()
vu_queue_fill()
vring_used_idx_set()
vu_queue_flush()
The new parameter is unused for now.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
VHOST_USER_SET_LOG_FD is an optional message with an eventfd
in ancillary data, it may be used to inform the front-end that the
log has been modified.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to vu_set_log_fd_exec()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
vhost-user protocol specification has been updated with
feature flags and commands we will need to implement migration.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
[sbrivio: Fix comment to union vhost_user_payload]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are pretty much two cases of the (misnomer) STALLED: in one
case, we could send more data to the guest if it becomes available,
and in another case, we can't, because we filled the window.
If, in this second case, we keep EPOLLIN enabled, but never read from
the socket, we get short but CPU-annoying storms of EPOLLIN events,
upon which we reschedule the ACK timeout handler, never read from the
socket, go back to epoll_wait(), and so on:
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
timerfd_settime(76, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=2, tv_nsec=0}}, NULL) = 0
epoll_wait(3, [{events=EPOLLIN, data={u32=10497, u64=38654716161}}], 8, 1000) = 1
also known as:
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
29.1517: Flow 2 (TCP connection): timer expires in 2.000s
which, for some reason, becomes very visible with muvm and aria2c
downloading from a server nearby in parallel chunks.
That's because EPOLLIN isn't cleared if we don't read from the socket,
and even with EPOLLET, epoll_wait() will repeatedly wake us up until
we actually read something.
In this case, we don't want to subscribe to EPOLLIN at all: all we're
waiting for is an ACK segment from the guest. Differentiate this case
with a new connection flag, ACK_FROM_TAP_BLOCKS, which doesn't just
indicate that we're waiting for an ACK from the guest
(ACK_FROM_TAP_DUE), but also that we're blocked waiting for it.
If this flag is set before we set STALLED, EPOLLIN will be masked
while we set EPOLLET because of STALLED. Whenever we clear STALLED,
we also clear this flag.
This is definitely not elegant, but it's a minimal fix.
We can probably simplify this at a later point by having a category
of connection flags directly corresponding to epoll flags, and
dropping STALLED altogether, or, perhaps, always using EPOLLET (but
we need a mechanism to re-check sockets for pending data if we can't
temporarily write to the guest).
I suspect that this might also be implied in
https://github.com/containers/podman/issues/23686, hence the Link:
tag. It doesn't necessarily mean I'm fixing it (I can't reproduce
that).
Link: https://github.com/containers/podman/issues/23686
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Before SO_PEEK_OFF support was introduced by commit e63d281871ef
("tcp: leverage support of SO_PEEK_OFF socket option when available"),
we would peek data from sockets using a "discard" buffer as first
iovec element, so that, unless we had no pending data at all, we would
always get a positive return code from recvmsg() (except for closing
connections or errors).
If we couldn't send more data to the guest, in the window, we would
set the STALLED flag (causing the epoll descriptor to switch to
edge-triggered mode), and return early from tcp_data_from_sock().
With SO_PEEK_OFF, we don't have a discard buffer, and if there's data
on the socket, but nothing beyond our current peeking offset, we'll
get EAGAIN instead of our current "discard" length. In that case, we
return even earlier, and we don't set EPOLLET on the socket as a
result.
As reported by Asahi Lina, this causes event loops where the kernel is
signalling socket readiness, because there's data we didn't dequeue
yet (waiting for the guest to acknowledge it), but we won't actually
peek anything new, and return early without setting EPOLLET.
This is the original report, mentioning the originally proposed fix:
--
When there is unacknowledged data in the inbound socket buffer, passt
leaves the socket in the epoll instance to accept new data from the
server. Since there is already data in the socket buffer, an epoll
without EPOLLET will repeatedly fire while no data is processed,
busy-looping the CPU:
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
epoll_pwait(3, [...], 8, 1000, NULL, 8) = 4
recvmsg(25, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(169, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(111, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(180, {msg_namelen=0}, MSG_PEEK) = -1 EAGAIN (Resource temporarily unavailable)
Add in the missing EPOLLET flag for this case. This brings CPU
usage down from around ~80% when downloading over TCP, to ~5% (use
case: passt as network transport for muvm, downloading Steam games).
--
we can't set EPOLLET unconditionally though, at least right now,
because we don't monitor the guest tap for EPOLLOUT in case we fail
to write on that side because we filled up that buffer (and not the
window of a TCP connection).
Instead, rely on the observation that, once a connection is
established, we only get EAGAIN on recvmsg() if we are attempting to
peek data from a socket with a non-zero peeking offset: we only peek
when there's pending data on a socket, and in that case, if we peek
without offset, we'll always see some data.
And if we peek data with a non-zero offset and get EAGAIN, that means
that we're either waiting for more data to arrive on the socket (which
would cause further wake-ups, even with EPOLLET), or we're waiting for
the guest to acknowledge some of it, which would anyway cause a
wake-up.
In that case, it's safe to set STALLED and, in turn, EPOLLET on the
socket, which fixes the EPOLLIN event loop.
While we're establishing a connection from the socket side, though,
we'll call, once, tcp_{buf,vu}_data_from_sock() to see if we got
any data while we were waiting for SYN, ACK from the guest. See the
comment at the end of tcp_conn_from_sock_finish().
And if there's no data queued on the socket as we check, we'll also
get EAGAIN, even if our peeking offset is zero. For this reason, we
need to additionally check that 'already_sent' is not zero, meaning,
explicitly, that our peeking offset is not zero.
Reported-by: Asahi Lina <lina@asahilina.net>
Fixes: e63d281871ef ("tcp: leverage support of SO_PEEK_OFF socket option when available")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I inadvertently added that in an unrelated change, but it doesn't make
sense: STALLED means we have pending socket data that we can't write
to the guest, not the other way around.
Fixes: bb708111833e ("treewide: Packet abstraction with mandatory boundary checks")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Under some conditions, linux can provide several buffers
in the same element (multiple entries in the iovec array).
I didn't identify what changed between the kernel guest that
provides one buffer and the one that provides several
(doesn't seem to be a kernel change or a configuration change).
Fix the following assert:
ASSERTION FAILED in virtqueue_map_desc (virtio.c:402): num_sg < max_num_sg
What I can see is the buffer can be splitted in two iovecs:
- vnet header
- packet data
This change manages this special case but the real fix will be to allow
tap_add_packet() to manage iovec array.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Increasingly often, I'm getting occasional failures of the same type
as https://github.com/containers/podman/issues/24147. I guess it
mostly depends on the system load.
It will be a while until I'll actually run tests on a kernel
including my fix for it, kernel commit a502ea6fa94b ("udp: Deal with
race between UDP socket address change and rehash"), so add a horrible
workaround using taskset(1), for the moment.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
csum_unfolded() must call csum_avx2() with a 32byte aligned base address.
To be able to do that if the buffer is not correctly aligned,
it splits the buffers in 2 parts, the second part is 32byte aligned and
can be used with csum_avx2(), the first part is the remaining part, that
is not 32byte aligned and we use sum_16b() to compute the checksum.
A problem appears if the length of the first part is odd because
the checksum is using 16bit words to do the checksum.
If the length is odd, when the second part is computed, all words are
shifted by 1 byte, meaning weight of upper and lower byte is swapped.
For instance a 13 bytes buffer:
bytes:
aa AA bb BB cc CC dd DD ee EE ff FF gg
16bit words:
AAaa BBbb CCcc DDdd EEee FFff 00gg
If we don't split the sequence, the checksum is:
AAaa + BBbb + CCcc + DDdd + EEee + FFff + 00gg
If we split the sequence with an even length for the first part:
(AAaa + BBbb) + (CCcc + DDdd + EEee + FFff + 00gg)
But if the first part has an odd length:
(AAaa + BBbb + 00cc) + (ddCC + eeDD + ffEE + ggFF)
To avoid the problem, do not call csum_avx2() if the first part cannot
have an even length, and compute the checksum of all the buffer using
sum_16b().
This is slower but it can only happen if the buffer base address is odd,
and this can only happen if the binary is built using '-Os', and that
means we have chosen to prioritize size over speed.
Reported-by: Mike Jones <mike@mjones.io>
Link: https://bugs.passt.top/show_bug.cgi?id=108
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Added comment explaining why we check for pad & 1]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In commit 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced
sockets") I just assumed that we wouldn't benefit from disabling
Nagle's algorithm once we drop TCP_CORK (and its 200ms fixed delay).
It turns out that with some patterns, such as a PostgreSQL server
in a container receiving parameterised, short queries, for which pasta
sees several short inbound messages (Parse, Bind, Describe, Execute
and Sync commands getting each one their own packet, 5 to 49 bytes TCP
payload each), we'll read them usually in two batches, and send them
in matching batches, for example:
9165.2467: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2468: Flow 0 (TCP connection (spliced)): 76 from read-side call
9165.2468: Flow 0 (TCP connection (spliced)): 76 from write-side call (passed 524288)
9165.2469: pasta: epoll event on connected spliced TCP socket 117 (events: 0x00000001)
9165.2470: Flow 0 (TCP connection (spliced)): 15 from read-side call
9165.2470: Flow 0 (TCP connection (spliced)): 15 from write-side call (passed 524288)
9165.2944: pasta: epoll event on connected spliced TCP socket 118 (events: 0x00000001)
and the kernel delivers the first one, waits for acknowledgement from
the receiver, then delivers the second one. This adds very substantial
and unnecessary delay. It's usually a fixed ~40ms between the two
batches, which is clearly unacceptable for loopback connections.
In this example, the delay is shown by the timestamp of the response
from socket 118. The peer (server) doesn't actually take that long
(less than a millisecond), but it takes that long for the kernel to
deliver our request.
To avoid batching and delays, disable Nagle's algorithm by setting
TCP_NODELAY on both internal and external sockets: this way, we get
one inbound packet for each original message, we transfer them right
away, and the kernel delivers them to the process in the container as
they are, without delay.
We can do this safely as we don't care much about network utilisation
when there's in fact pretty much no network (loopback connections).
This is unfortunately not visible in the TCP request-response tests
from the test suite because, with smaller messages (we use one byte),
Nagle's algorithm doesn't even kick in. It's probably not trivial to
implement a universal test covering this case.
Fixes: 7ecf69329787 ("pasta, tcp: Don't set TCP_CORK on spliced sockets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
On Alpine Linux 3.21, passt aborts right away as soon as QEMU connects
to it.
Most likely, this has always been the case with musl, because since
musl commit dc01e2cbfb29 ("add fallback emulation for accept4 on old
kernels"), accept4() without flags is implemented using accept().
However, I guess that nobody realised earlier because it's typically
pasta(1) being used on musl-based distributions, and the only place
where we call accept4() without flags is tap_listen_handler().
Add accept() to the list of allowed system calls regardless of the
presence of accept4().
Reported-by: NN708 <nn708@outlook.com>
Link: https://bugs.passt.top/show_bug.cgi?id=106
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We don't modify the structure in some virtio functions.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
It was reported that SSDP notifications sent from a container (with
e.g. minidlna) stopped appearing on the network starting from commit
1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking").
As a minimal reproducer using minidlnad(8):
$ mkdir /tmp/minidlna
$ cat conf
media_dir=/tmp/minidlna
db_dir=/tmp/minidlna
$ ./pasta -d --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
1.0327: Flow 0 (NEW): FREE -> NEW
1.0327: Flow 0 (INI): NEW -> INI
1.0327: Flow 0 (INI): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0327: Flow 0 (INI): Invalid endpoint on UDP packet
1.0327: Flow 0 (FREE): INI -> FREE
1.0328: Flow 0 (FREE): TAP [88.198.0.164]:54185 -> [239.255.255.250]:1900 => ?
1.0328: Dropping datagram with no flow TAP 88.198.0.164:54185 -> 239.255.255.250:1900
This is an actual regression as there's no particular reason to block
outbound multicast UDP packets.
And even if we don't handle multicast groups in any particular way
(https://bugs.passt.top/show_bug.cgi?id=2, "Add IGMP/MLD proxy"),
there's no reason to block inbound multicast or broadcast packets
either, should they ever be somehow delivered to passt or pasta.
Let multicast and broadcast packets through, refusing only to
establish flows with unspecified endpoint, as those would actually
cause havoc in the flow table.
IP-wise, SSDP notifications look like this (after this patch), inside
and outside:
$ pasta -p /tmp/minidlna.pcap --config-net -- sh -c '/usr/sbin/minidlnad -p 31337 -S -f minidlna.conf -P /dev/null & (sleep 1; killall minidlnad)'
[...]
$ tshark -a packets:1 -r /tmp/minidlna.pcap ssdp
2 0.074808 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
# tshark -i ens3 -a packets:1 multicast 2>/dev/null
1 0.000000000 88.198.0.164 ? 239.255.255.250 SSDP 200 NOTIFY * HTTP/1.1
Link: https://github.com/containers/podman/issues/24871
Fixes: 1db4f773e87f ("udp: Improve detail of UDP endpoint sanity checking")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
I don't think it's necessarily productive to check all the possible
error conditions in the Makefile, but this one is annoying: issue
'make' without a C compiler, then install one, and build again.
Then run passt and it will mysteriously terminate on epoll_wait(),
because seccomp.h is good enough to build against, but the resulting
seccomp filter doesn't allow any system call. Not really fun to debug.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
With glibc commit 25a5eb4010df ("string: strerror, strsignal cannot
use buffer after dlmopen (bug 32026)"), strerror() now needs, at least
on x86, the getrandom() and brk() system calls, in order to fill in
the locale-translated error message. But getrandom() and brk() are not
allowed by our seccomp profiles.
This became visible on Fedora Rawhide with the "podman login and
logout" Podman tests, defined at test/e2e/login_logout_test.go in the
Podman source tree, where pasta would terminate upon printing error
descriptions (at least the ones related to the SO_ERROR queue for
spliced connections).
Avoid dynamic memory allocation by calling strerrordesc_np() instead,
which is a GNU function returning a static, untranslated version of
the error description. If it's not available, keep calling strerror(),
which at that point should be simple enough as to be usable (at least,
that's currently the case for musl).
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/24804
Analysed-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
During testing it is sometimes useful to force traffic which would
normally be forwared by socket splicing through the tap interface.
In this commit, we add a command switch enabling such funtionality
for inbound local traffic.
For outbound local traffic this is much trickier, if even possible,
so leave that for a later commit.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>