When the guest tries to establish a connection, it could give up on it by
sending a FIN,ACK instead of a plain ACK to our SYN,ACK. It could then
make a new attempt to establish a connection with the same addresses and
ports with a new SYN.
Although it's unlikely, it could send the 2nd SYN very shortly after the
FIN,ACK resulting in both being received in the same batch of packets from
the tap interface.
Currently, we don't handle that correctly, when we receive a FIN,ACK on a
not fully established connection we discard the remaining packets in the
batch, and so will never process the 2nd SYN. Correct this by returning
1 from tcp_tap_handler() in this case, so we'll just consume the FIN,ACK
and continue to process the rest of the batch.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a number of conditions where we will issue a TCP RST in response
to something unexpected we received from the tap interface. These occur in
both tcp_data_from_tap() and tcp_tap_handler(). In tcp_tap_handler() use
a 'goto out of line' technique to consolidate all these paths into one
place. For the tcp_data_from_tap() cases use a negative return code and
direct that to the same path in tcp_tap_handler(), its caller.
In this case we want to discard all remaining packets in the batch we have
received: even if they're otherwise good, they'll be invalidated when the
guest receives the RST we're sending. This is subtly different from the
case where we *receive* an RST, where we could in theory get a new SYN
immediately afterwards. Clarify that with a common on the now common
reset path.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Although it's unlikely in practice, the guest could theoretically
reset one TCP connection then immediately start a new one with the
same addressses and ports, such that we get an RST then a SYN in the
same batch of received packets in tcp_tap_handler().
We don't correctly handle that unlikely case, because when we receive
the RST, we discard any remaining packets in the batch so we'd never
see the SYN. This could happen in either tcp_tap_handler() or
tcp_data_from_tap(). Correct that by returning 1, so that the caller
will continue calling tcp_tap_handler() on subsequent packets allowing
us to process any subsequent SYN.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently tcp_data_from_tap() is assumed to consume all packets remaining
in the packet pool it is given. However there are some edge cases where
that's not correct. In preparation for fixing those, change it to return
a count of packets consumed and use that in its caller.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
>From a practical point of view, when a TCP connection ends, whether by
FIN or by RST, we set the CLOSED event, then some time later we remove the
connection from the hash table and clean it up. However, from a protocol
point of view, once it's closed, it's gone, and any new packets with
matching addresses and ports are either forming a new connection, or are
invalid packets to discard.
Enforce these semantics in the TCP hash logic by never hash matching closed
connections.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Both tcp_data_from_tap() and tcp_tap_handler() call packet_get() to get
the entire L4 packet length, then immediately call it again to check that
the packet is long enough to include a TCP header. The features of
packet_get() let us easily combine these together, we just need to adjust
the length slightly, because we want the value to include the TCP header
length.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each DUP
sequence calling udp_tap_handler(). Or so it appears.
In fact, udp_tap_handler() doesn't take an index and always starts with
packet 0 of the sequence, even if called repeatedly. It appears to be
written with the idea that the struct pool is a queue, from which it
consumes packets as it processes them, but that's not how the pool data
structure works.
Correct this by adding an index parameter to udp_tap_handler() and altering
the loops in tap.c to step through the pool properly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In both tap4_handler() and tap6_handler(), once we've sorted incoming l3
packets into "sequences", we then step through all the packets in each TCP
sequence calling tcp_tap_handler(). Or so it appears.
In fact, tcp_tap_handler() doesn't take an index and always looks at packet
0 of the sequence, except when it calls tcp_data_from_tap() to process
data packets. It appears to be written with the idea that the struct pool
is a queue, from which it consumes packets as it processes them, but that's
not how the pool data structure works - they are more like an array of
packets.
We only get away with this, because setup packets for TCP tend to come in
separate batches (because we need to reply in between) and so we only get
a bunch of packets for the same connection together when they're data
packets (tcp_data_from_tap() has its own loop through packets).
Correct this by adding an index parameter to tcp_tap_handler() and altering
the loops in tap.c to step through the pool properly.
Link: https://bugs.passt.top/show_bug.cgi?id=68
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Ugly as hell, but we keep breaking things otherwise, and I keep
forgetting to run this manually (as long as it's based on my local
Podman setup, that's the only alternative).
We need to clone the Podman repository as distribution packages don't
contain test scripts, typically. While at it, build the latest
version which is what really matters.
As we're planning anyway to revamp the test framework, I'd be
inclined to just add this without too many thoughts, and have it as
a nice-to-have requirement reminder for the new framework.
Link: https://github.com/containers/podman/pull/19699
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
BOOTP clients do not use tagged messages for requests.
As such, any message without the DHCP option 53, should be
considered a BOOTP request.
Link: https://bugs.passt.top/show_bug.cgi?id=72
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
l3_len was calculated from the ethernet frame size, and it
was assumed to be equal to the length stored in an IP packet.
But if the ethernet frame is padded, then l3_len calculated
that way can only be used as a bound check to validate the
length stored in an IP header. It should not be used for
calculating the l4_len.
This patch makes sure the small padded ethernet frames are
properly processed, by trusting the length stored in an IP
header.
Link: https://bugs.passt.top/show_bug.cgi?id=73
Signed-off-by: Stas Sergeev <stsp2@yandex.ru>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The hard link trick didn't actually fix the issue with SELinux file
contexts properly: as opposed to symbolic links, SELinux now
correctly associates types to the labels that are set -- except that
those labels are now shared, so we can end up (depending on how
rpm(8) extracts the archives) with /usr/bin/passt having a
pasta_exec_t context.
This got rather confusing as running restorecon(8) seemed to fix up
labels -- but that's simply toggling between passt_exec_t and
pasta_exec_t for both links, because each invocation will just "fix"
the file with the mismatching context.
Replace the hard links with two separate builds of the binary, as
suggested by David. The build is reproducible, so we pass "-pasta" in
the VERSION for pasta's build. This is wasteful but better than the
alternative.
Just copying the binary over would otherwise cause issues with
debuginfo packages due to duplicate Build-IDs -- and rpmbuild(8) also
warns about them.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If pasta and pasta.avx2 are hard links to passt and passt.avx2,
AppArmor will attach their own profiles on execution, and we can
restrict passt's profile to what it actually needs. Note that pasta
needs to access all the resources that passt needs, so the pasta
abstraction still includes passt's one.
I plan to push the adaptation required for the Debian package in
commit 5bb812e79143 ("debian/rules: Override pasta symbolic links
with hard links"), on Salsa. If other distributions need to support
AppArmor profiles they can follow a similar approach.
The profile itself will be installed, there, via dh_apparmor, in a
separate commit, b52557fedcb1 ("debian/rules: Install new pasta
profile using dh_apparmor").
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since commit b0e450aa8500 ("pasta: Detach mount namespace, (re)mount
procfs before spawning command"), we need to explicitly permit mount
of /proc, and access to entries under /proc/PID/net (after remount,
that's what AppArmor sees as path).
Fixes: b0e450aa8500 ("pasta: Detach mount namespace, (re)mount procfs before spawning command")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Starting with commit 770d1a4502dd ("isolation: Initially Keep
CAP_SETFCAP if running as UID 0 in non-init"), the lack of this rule
became more apparent as pasta needs to access uid_map in procfs even
as non-root.
However, both passt and pasta needs this, in case they are started as
root, so add this directly to passt's abstraction (which is sourced
by pasta's profile too).
Fixes: 770d1a4502dd ("isolation: Initially Keep CAP_SETFCAP if running as UID 0 in non-init")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As a result of AppArmor commit d4b0fef10a4a ("parser: fix rule flag
generation change_mount type rules"), we can't expect anymore to
get permission to mount() / read-write, with MS_REC | MS_UNBINDABLE
("runbindable", in AppArmor terms), if we don't explicitly pass those
flags as options. It used to work by mistake.
Now, the reasonable expectation would be that we could just change the
existing rule into:
mount options=(rw, runbindable) "" -> /,
...but this now fails to load too, I think as a result of AppArmor
commit 9d3f8c6cc05d ("parser: fix parsing of source as mount point
for propagation type flags"). It works with 'rw' alone, but
'runbindable' is indeed a propagation type flag.
Skip the source specification, it doesn't add anything meaningful to
the rule anyway.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/pull/19751
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
While abstractions/nameservice appeared too broad and overkill for
our simple need (read-only resolv.conf access), it properly deals
with symlinked resolv.conf files generated by systemd-resolved,
NetworkManager or suchlike.
If we just grant read-only access to /etc/resolv.conf, we'll fail to
read nameserver information in rather common configurations, because
AppArmor won't follow the symlink.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Host routes can include a preferred source address (RTA_PREFSRC), which
must be one of the host's addresses. However when using pasta with -a the
namespace might be given a different address, not on the host. This seems
to occur pretty routinely depending on the network configuration systems
in place on the host.
With --config-net we will try to copy host routes to the namespace. If
one of those includes an RTA_PREFSRC, but the namespace doesn't have the
host address, this will fail with -EINVAL, causing pasta to fail.
Fix this by stripping off RTA_PREFSRC attributes from routes as we copy
them to the namespace. This is by no means infallible, bit it should at
least handle common cases for the time being.
Link: https://bugs.passt.top/show_bug.cgi?id=71
Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Otherwise, we actually configure the address, but it's not usable
because no local route is added by the kernel.
Link: https://github.com/containers/podman/pull/19699
Fixes: cfe7509e5c16 ("netlink: Use struct in_addr for IPv4 addresses, not bare uint32_t")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_defer_handler() performs a potentially expensive linear scan of the
connection table. So, to mitigate the cost of that we skip if if we're not
under at least moderate pressure: either 30% of available connections or
30% (estimated) of available fds used.
But, the calculation for this has been broken since it was introduced: we
calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS,
meaning we only exit early if conn_count is less than 30% of itself, i.e.
never.
If that calculation is "corrected" to be based on TCP_MAX_CONNS, it
completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my
laptop. My guess is that this is because in the case of many short lived
connections, we're letting the table become much fuller before compacting
it. That means that other places which perform a table scan now have to
do much, much more.
For the time being, simply remove the tests, since they're not doing
anything useful. We can reintroduce them more carefully if we see a need
for them.
This also removes the only user of c->tcp.splice_conn_count, so that can
be removed as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This was overlooked when the file was created.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The in_epoll boolean is one of only two fields (currently) in the common
structure shared between tap and spliced connections. It seems like it
belongs there, because both tap and spliced connections use it, and it has
roughly the same meaning.
Roughly, however, isn't exactly: which fds this flag says are in the epoll
varies between the two connection types, and are in type specific fields.
So, it's only possible to meaningfully use this value locally in type
specific code anyway.
This common field is going to get in the way of more widespread
generalisation of connection / flow tracking, so move it to separate fields
in the tap and splice specific structures.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to. So
we pre-fill this destination address in our header buffers for IPv4. We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest. In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.
Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf(). In fact for
TCP we already redundantly filled the destination for each packet anyway.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We partially prepopulate IP and TCP header structures including, amongst
other things the destination address, which for IPv4 is always the known
address of the guest/namespace. We partially precompute both the IPv4
header checksum and the TCP checksum based on this.
In future we're going to want more flexibility with controlling the
destination for IPv4 (as we already do for IPv6), so this precomputed value
gets in the way. Therefore remove the IPv4 destination from the
precomputed checksum and fold it into the checksum update when we actually
send a packet.
Doing this means we no longer need to recompute those partial sums when
the destination address changes ({tcp,udp}_update_l2_buf()) and instead
the computation can be moved to compile time. This means while we perform
slightly more computations on each packet, we slightly reduce the amount of
memory we need to access.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since
it's used for connections in both directions. However, these values are
just feeding a hash, so as long as we're consistent and include all the
information we want, it doesn't really matter.
Oddly, for the "src" side we supply the (tap side) forwarding address but
the (tap side) endpoint port. This again doesn't really matter, but it's
confusing. So swap this with dstport, so "src" is always forwarding
and "dst" is always endpoint.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In a number of places the comments and variable names we use to describe
addresses and ports are ambiguous. It's not sufficient to describe a port
as "tap-facing" or "socket-facing", because on both the tap side and the
socket side there are two ports for the two ends of the connection.
Similarly, "local" and "remote" aren't particularly helpful, because it's
not necessarily clear whether we're talking from the point of view of the
guest/namespace, the host, or passt itself.
This patch makes a number of changes to be more precise about this. It
introduces two new terms in aid of this:
A "forwarding" address (or port) refers to an address which is local
from the point of view of passt itself. That is a source address for
traffic sent by passt, whether it's to the guest via the tap interface
or to a host on the internet via a socket.
The "endpoint" address (or port) is the reverse: a remote address
from passt's point of view, the destination address for traffic sent
by passt.
Between them the "side" (either tap/guest-facing or sock/host-facing)
and forwarding vs. endpoint unambiguously describes which address or
port we're talking about.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The tap code passes the IPv4 or IPv6 destination address of packets it
receives to the protocol specific code. Currently that protocol code
doesn't use the source address, but we want it to in future. So, in
preparation, pass the IPv4/IPv6 source address of tap packets to those
functions as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In tap6_handler() saddr is initialized to the IPv6 source address from the
incoming packet. However part way through, but before organizing the
packet into a "sequence" we set it unconditionally to the guest's assigned
address. We don't do anything equivalent for IPv4.
This doesn't make a lot of sense: if the guest is using a different source
address it makes sense to consider these different sequences of packets and
we shouldn't try to combine them together.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...now it gets ugly. If we use pasta without an existing target
namespace, and run commands directly or spawn a shell, and keep
the pasta_t domain when we do, they won't be able to do much: a
shell might even start, but it's not going to be usable, or to
even display a prompt.
Ideally, pasta should behave like a shell when it spawns a command:
start as unconfined_t and automatically transition to whatever
domain is associated in the specific policy for that command. But
we can't run as unconfined_t, of course.
It would seem natural to switch to unconfined_t "just before", so
that the default transitions happen. But transitions can only happen
when we execvp(), and that's one single transition -- not two.
That is, this approach would work for:
pasta -- sh -c 'ip address show'
but not for:
pasta -- ip address show
If we configure a transition to unconfined_t when we run ip(8), we'll
really try to start that as unconfined_t -- but unconfined_t isn't
allowed as entrypoint for ip(8) itself, and execvp() will fail.
However, there aren't many different types of binaries pasta might
commonly run -- for example, we're unlikely to see pasta used to run
a mount(8) command.
Explicitly set up domain transition for common stuff -- switching to
unconfined_t for bin_t and shells works just fine, ip(8), ping(8),
arping(8) and similar need a different treatment.
While at it, allow commands we spawn to inherit resource limits and
signal masks, because that's what happens by default, and don't
require AT_SECURE sanitisation of the environment (because that
won't happen by default). Slightly unrelated: we also need to
explicitly allow pasta_t to use TTYs, not just PTYs, otherwise
we can't keep stdin and stdout open for shells.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
This is needed to monitor filesystem-bound namespaces and quit when
they're gone -- this feature never really worked with SELinux.
Fixes: 745a9ba4284c ("pasta: By default, quit if filesystem-bound net namespace goes away")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
That's what we actually need to check networking-related sysctls,
to scan for bound ports, and to manipulate bits of network
configuration inside pasta's target namespaces.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Somehow most of this used to work on older kernels, but now we need
to explicitly permit setuid, setgid, and setcap capabilities, as well
as read-only access to passwd (as we support running under a given
login name) and sssd library facilities.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Kernel commit ed5d44d42c95 ("selinux: Implement userns_create hook")
seems to just introduce a new functionality, but given that SELinux
implements a form of mandatory access control, introducing the new
permission breaks any application (shipping with SELinux policies)
that needs to create user namespaces, such as passt and pasta for
sandboxing purposes.
Add the new 'allow' rules. They appear to be backward compatible,
kernel-wise, and the policy now requires the new 'user_namespace'
class to build, but that's something distributions already ship.
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
There's no reason to use wildcards, and we don't want any
similarly-named binary (not that I'm aware of any) to risk being
associated to passt_exec_t and pasta_exec_t by accident.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
The Makefile installs symbolic links by default, which actually
worked at some point (not by design) with SELinux, but at least on
recent kernel versions it doesn't anymore: override pasta (and
pasta.avx2) with hard links.
Otherwise, even if the links are labeled as pasta_exec_t, SELinux
will "resolve" them to passt_exec_t, and we'll have pasta running as
passt_t instead of pasta_t.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Richard W.M. Jones <rjones@redhat.com>
Spotted by Coverity, relatively harmless.
Fixes: e01759e2fab0 ("tap: Explicitly drop IPv4 fragments, and give a warning")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When we copy addresses from the host to the container in nl_addr_dup(), we
copy all the address's attributes, including IFA_CACHEINFO, which controls
the address's lifetime. If the host address is managed by, for example,
DHCP, it will typically have a finite lifetime.
When we copy that lifetime to the pasta container, that lifetime will
remain, meaning the kernel will eventually remove the address, typically
some hours later. The container, however, won't have the DHCP client or
whatever was managing and maintaining the address in the host, so it will
just lose connectivity.
Long term, we may want to monitor host address changes and reflect them to
the guest. But for now, we just want to take a snapshot of the host's
address and set those in the container permanently. We can accomplish that
by stripping off the IFA_CACHEINFO attribute as we copy addresses.
Link: https://github.com/containers/podman/issues/19405
Link: https://bugs.passt.top/show_bug.cgi?id=70
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In nl_addr_get() and nl_addr_dup() we step the attributes attached to each
RTM_NEWADDR message with a loop initialised with IFA_RTA() and
RTM_PAYLOAD() macros. RTM_PAYLOAD(), however is for RTM_NEWROUTE messages
(struct rtmsg), not RTM_NEWADDR messages (struct ifaddrmsg). Consequently
it miscalculates the size and means we can skip some attributes. Switch
to IFA_PAYLOAD() which we should be using here.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In the loop within nl_addr_dup() we check and skip for any messages that
aren't of type RTM_NEWADDR. This is a leftover that was missed in the
recent big netlink cleanup. In fact we already check for the message type
in the nl_foreach_oftype() macro, so the explicit test is redudant.
Remove it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We give a fatal error if the port ranges from any port forwarding
specifiers overlap. This occurs even if those port ranges are specifically
bound to different addresses, so there's not really any overlap.
Right now, we can't 100% handle this case correctly, because our data
structures don't have a way to represent per-address forwarding. However,
there are a number of cases that will actually work just fine: e.g. mapping
the same port to the same port on two different addresses (say :: and
127.0.0.1).
We have long term plans to fix this properly, but that is still some time
away. For the time being, demote this error to a warning so that the cases
that already work will be allowed.
Link: https://bugs.passt.top/show_bug.cgi?id=56
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we have a single epoll event type for the "tap" fd, which could
be either a handle on a /dev/net/tun device (pasta) or a connected Unix
socket (passt). However for the two modes we call different handler
functions. Simplify this a little by using different epoll types and
dispatching directly to the correct handler function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tap_handler() actually handles events on three different types of object:
the /dev/tap character device (pasta), a connected Unix domain socket
(passt) or a listening Unix domain socket (passt).
The last, in particular, really has no handling in common with the others,
so split it into its own epoll type and directly dispatch to the relevant
handler from the top level.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common. Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level. Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning. So,
switch listening sockets to their own reference type which we can lay out
as we please. That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_handler() actually handles several different types of fd events.
This includes timerfds that aren't sockets at all. The handling of these
has essentially nothing in common with the other cases. So, give the
TCP timers there own epoll_type value and dispatch directly to their
handler. This also means we can remove the timer field from tcp_epoll_ref,
the information it encoded is now implicit in the epoll_type value.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Move the test for c->no_udp into the function itself, rather than in the
dispatching switch statement to better localize the UDP specific logic, and
make for greated consistency with other handler functions.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have different epoll type values for ICMP and ICMPv6 sockets, but they
both call the same handler function, icmp_sock_handler(). However that
function does essentially nothing in common for the two cases. So, split
it into icmp_sock_handler() and icmpv6_sock_handler() and dispatch them
separately from the top level.
While we're there remove some parameters that the function was never using
anyway. Also move the test for c->no_icmp into the functions, so that all
the logic specific to ICMP is within the handler, rather than in the top
level dispatch code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When we process events from epoll_wait(), we check for a number of special
cases before calling sock_handler() which then dispatches based on the
protocol type of the socket in the event.
Fold these cases together into a single switch on the fd type recorded in
the epoll data field.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
epoll_ref contains a variety of information useful when handling epoll
events on our sockets, and we place it in the epoll_event data field
returned by epoll. However, for a few other things we use the 'fd' field
in the standard union of types for that data field.
This actually introduces a bug which is vanishingly unlikely to hit in
practice, but very nasty if it ever did: theoretically if we had a very
large file descriptor number for fd_tap or fd_tap_listen it could overflow
into bits that overlap with the 'proto' field in epoll_ref. With some
very bad luck this could mean that we mistakenly think an event on a
regular socket is an event on fd_tap or fd_tap_listen.
More practically, using different (but overlapping) fields of the
epoll_data means we can't unify dispatch for the various different objects
in the epoll. Therefore use the same epoll_ref as the data for the tap
fds and the netns quit fd, adding new fd type values to describe them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd. However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future. Rename these fields to
an abstract "fd type" and file descriptor for more generality.
Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space. For now these just correspond to the
supported protocols, but we'll expand on that in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We call tap_sock_reset() if tap_handler_passt() fails, or if we get an
error event on the socket. Fold that logic into tap_handler() passt itself
which simplifies the caller. It also makes it clearer that we had a
redundant EPOLL_CTL_DEL and close() in one of the reset paths, so fix that
too.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>