timespec_diff_ms() returns an int representing a duration in milliseconds.
This will overflow in about 25 days when an int is 32 bits. The way we
use this function, we're probably not going to get a result that long, but
it's not outrageously implausible. Use a long for safety.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A negative bit index in a bitmap doesn't make sense. Avoid this by
construction by using unsigned indices. While we're there adjust
bitmap_isset() to return a bool instead of an int.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We won't call it from main() any longer: move it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
As I'm adding pidfile_open() in the next patch. The function comment
didn't match, by the way.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Richard W.M. Jones <rjones@redhat.com>
When reporting errors, we sometimes want to show a relevant socket address.
Doing so by extracting the various relevant fields can be pretty awkward,
so introduce a sockaddr_ntop() helper to make it simpler. For now we just
have one user in tcp.c, but I have further upcoming patches which can make
use of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In write_remainder() 'skip' is the offset to start the operation from
in the iovec array.
In iov_skip_bytes(), 'skip' is also the offset in the iovec array but
'offset' is the first unskipped byte in the iovec entry.
As write_remainder() uses 'skip' for both, 'skip' is reset to the
first unskipped byte in the iovec entry rather to staying the first
unskipped byte in the iovec array.
Fix the problem by introducing a new variable not to overwrite 'skip'
on each loop.
Fixes: 8bdb0883b4 ("util: Add write_remainder() helper")
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently ping sockets use a custom epoll reference type which includes
the ICMP id. However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id. Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.
Having done this we no longer need separate EPOLL_TYPE_ICMP and
EPOLL_TYPE_ICMPV6 reference types, because we can easily determine
which case we have from the flow type. Merge both types into
EPOLL_TYPE_PING.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Introduce ip.[ch] file to encapsulate IP protocol handling functions and
structures. Modify various files to include the new header ip.h when
it's needed.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-ID: <20240303135114.1023026-5-lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We have several places where we want to write(2) a buffer or buffers and we
handle short write()s by retrying until everything is successfully written.
Add a helper for this in util.c.
This version has some differences from the typical write_all() function.
First, take an IO vector rather than a single buffer, because that will be
useful for some of our cases. Second, allow it to take an parameter to
skip the first n bytes of the given buffers. This will be useful for some
of the cases we want, and also falls out quite naturally from the
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Minor formatting fixes in write_remainder()]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sometimes we use sa_family_t for variables and parameters containing a
socket address family, other times we use a plain int. Since sa_family_t
is what's actually used in struct sockaddr and friends, standardise on
that.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Sufficiently recent cppcheck (I'm using 2.13.0) seems to have added another
warning for pointer variables which could be pointer to const but aren't.
Use this to make a bunch of variables const pointers where they previously
weren't for no particular reason.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
sock_l4() takes NULL for ifname if you don't want to bind the socket to a
particular interface. However, for a number of the callers, it's more
natural to use an empty string for that case. Change sock_l4() to accept
either NULL or an empty string equivalently, and simplify some callers
using that change.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
IPv4 addresses can be stored in an in_addr_t or a struct in_addr. The
former is just a type alias to a 32-bit integer, so doesn't really give us
any type checking. Therefore we generally prefer the structure, since we
mostly want to treat IP address as opaque objects. Fix a few places where
we still use in_addr_t, but can just as easily use struct in_addr.
Note there are still some uses of in_addr_t in conf.c, but those are
justified: since they're doing prefix calculations, they actually need to
look at the internals of the address as an integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The original commit message says:
---
Currently we initialise the address field of the sockaddrs we construct
to the any/unspecified address, but not in a very clear way: we use
explicit 0 values, which is only interpretable if you know the order of
fields in the sockaddr structures. Use explicit field names, and explicit
initialiser macros for the address.
Because we initialise to this default value, we don't need to explicitly
set the any/unspecified address later on if the caller didn't pass an
overriding bind address.
---
and the original patch modified the initialisation of addr4 and
addr6:
- instead of { 0 }, { 0 } for sin_addr and sin_zero,
.sin_addr = IN4ADDR_ANY_INIT
- instead of 0, IN6ADDR_ANY_INIT, 0:
.sin6_addr = IN6ADDR_ANY_INIT
but I dropped those hunks: they break gcc versions 7 to 9 as reported
in eed6933e6c ("udp: Explicitly initialise sin6_scope_id and
sin_zero in sockaddr_in{,6}").
I applied the rest of the changes.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Dropped first two hunks]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
When pasta periodically scans bound ports and binds them on the other
side in order to forward traffic, we bind UDP ports for corresponding
TCP port numbers, too, to support protocols and applications such as
iperf3 which use UDP port numbers matching the ones used by the TCP
data connection.
If we scan UDP ports in order to bind UDP ports, we skip detection of
the UDP ports we already bound ourselves, to avoid looping back our
own ports. Same with scanning and binding TCP ports.
But if we scan for TCP ports in order to bind UDP ports, we need to
skip bound TCP ports too, otherwise, as David pointed out:
- we find a bound TCP port on side A, and bind the corresponding TCP
and UDP ports on side B
- at the next periodic scan, we find that UDP port bound on side B,
and we bind the corresponding UDP port on side A
- at this point, we unbind that UDP port on side B: we would
otherwise loop back our own port.
To fix this, we need to avoid binding UDP ports that we already
bound, on the other side, as a consequence of finding a corresponding
bound TCP port.
Reproducing this issue is straightforward:
./pasta -- iperf3 -s
# Wait one second, then from another terminal:
iperf3 -c ::1 -u
Reported-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Analysed-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: 457ff122e3 ("udp,pasta: Periodically scan for ports to automatically forward")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Most of our helpers which need to enter the pasta network namespace are
quite specialised. Add one which is rather general - it just open()s a
given file in the namespace context and returns the fd back to the main
namespace. This will have some future uses.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The implementation of scanning /proc files to do automatic port forwarding
is a bit awkwardly split between procfs_scan_listen() in util.c,
get_bound_ports() and related functions in conf.c and the initial setup
code in conf().
Consolidate all of this into port_fwd.h, which already has some related
definitions, and a new port_fwd.c.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Newer versions of cppcheck (as of 2.12.0, at least) added a warning for
pointers which could be declared to point at const data, but aren't.
Based on that, make many pointers throughout the codebase const.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
A classic gotcha of the standard C library is that its unwise to call any
variable 'index' because it will shadow the standard string library
function index(3). This can cause warnings from cppcheck amongst others,
and it also means that if the variable is removed you tend to get confusing
type errors (or sometimes nothing at all) instead of a nice simple "name is
not defined" error.
Strictly speaking this only occurs if <string.h> is included, but that
is so common that as a rule it's best to just avoid it always. We
have a number of places which hit this trap, so rename variables and
parameters to avoid it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
tcp_sock_handler() handles both listening TCP sockets, and connected TCP
sockets, but what it needs to do in those cases has essentially nothing in
common. Therefore, give listening sockets their own epoll_type value and
dispatch directly to their own handler from the top level. Furthermore,
the two handlers need essentially entirely different information from the
reference: we re-(ab)used the index field in the tcp_epoll_ref to indicate
the port for the listening socket, but that's not the same meaning. So,
switch listening sockets to their own reference type which we can lay out
as we please. That lets us remove the listen and outbound fields from the
normal (connected) tcp_epoll_ref, reducing it to just the connection table
index.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The epoll_ref type includes fields for the IP protocol of a socket, and the
socket fd. However, we already have a few things in the epoll which aren't
protocol sockets, and we may have more in future. Rename these fields to
an abstract "fd type" and file descriptor for more generality.
Similarly, rather than using existing IP protocol numbers for the type,
introduce our own number space. For now these just correspond to the
supported protocols, but we'll expand on that in future.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
ns_enter() returns an integer... but it's always zero. If we actually fail
the function doesn't return. Therefore it makes more sense for this to be
a function returning void, and we can remove the cases where we pointlessly
checked its return value.
In addition ns_enter() is usually called from an ephemeral thread created
by NS_CALL(). That means that the exit(EXIT_FAILURE) there usually won't
be reported (since NS_CALL() doesn't wait() for the thread). So, use die()
instead to print out some information in the unlikely event that our
setns() here does fail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
union epoll_ref has a deeply nested set of structs and unions to let us
subdivide it into the various different fields we want. This means that
referencing elements can involve an awkward long string of intermediate
fields.
Using C11 anonymous structs and unions lets us do this less clumsily.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We'll need this in isolate_initial(). While at it, don't rely on
BUFSIZ: the earlier issue we had with musl reminded me it's not a
magic "everything will fit" value. Size the read buffer to what we
actually need from uid_map, and check for the final newline too,
because uid_map is organised in lines.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
In practical terms, passt doesn't benefit from the additional
protection offered by the AGPL over the GPL, because it's not
suitable to be executed over a computer network.
Further, restricting the distribution under the version 3 of the GPL
wouldn't provide any practical advantage either, as long as the passt
codebase is concerned, and might cause unnecessary compatibility
dilemmas.
Change licensing terms to the GNU General Public License Version 2,
or any later version, with written permission from all current and
past contributors, namely: myself, David Gibson, Laine Stump, Andrea
Bolognani, Paul Holzinger, Richard W.M. Jones, Chris Kuhn, Florian
Weimer, Giuseppe Scrivano, Stefan Hajnoczi, and Vasiliy Ulyanov.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
...starting from sock_l4(), pass negative error (errno) codes instead
of -1. They will only be used in two commits from now, no functional
changes intended here.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Roughly inspired from a patch by Chris Kuhn: fix up includes so that
we can build against musl: glibc is more lenient as headers generally
include a larger amount of other headers.
Compared to the original patch, I only included what was needed
directly in C files, instead of adding blanket includes in local
header files. It's a bit more involved, but more consistent with the
current (not ideal) situation.
Reported-by: Chris Kuhn <kuhnchris+github@kuhnchris.eu>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
ia64 needs to use __clone2() as clone() is not available, but glibc
doesn't export the prototype. Take it from clone(2) to avoid an
implicit declaration:
util.c: In function ‘do_clone’:
util.c:512:16: warning: implicit declaration of function ‘__clone2’ [-Wimplicit-function-declaration]
512 | return __clone2(fn, stack_area + stack_size / 2, stack_size / 2,
| ^~~~~~~~
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are some places in passt/pasta which #include <assert.h> and make
various assertions. If we hit these something has already gone wrong, but
they're there so that we a useful message instead of cryptic misbehaviour
if assumptions we thought were correct turn out not to be.
Except.. the glibc implementation of assert() uses syscalls that aren't in
our seccomp filter, so we'll get a SIGSYS before it actually prints the
message. Work around this by adding our own ASSERT() implementation using
our existing err() function to log the message, and an abort(). The
abort() probably also won't work exactly right with seccomp, but once we've
printed the message, dying with a SIGSYS works just as well as dying with
a SIGABRT.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
According to its doc comments, sock_l4() returns -1 on error. It does,
except in one case where it returns -EIO. Fix this inconsistency to match
the docs and always return -1.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, when instructed to open an IPv6 socket, sock_l4() explicitly
sets the IPV6_V6ONLY socket option so that the socket will only respond to
IPv6 connections. Linux (and probably other platforms) allow "dual stack"
sockets: IPv6 sockets which can also accept IPv4 connections.
Extend sock_l4() to be able to make such sockets, by passing AF_UNSPEC as
the address family and no bind address (binding to a specific address would
defeat the purpose). We add a Makefile define 'DUAL_STACK_SOCKETS' to
indicate availability of this feature on the target platform.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Spotted in Debian's buildd logs: on ia64, clone(2) is not available:
the glibc wrapper is named __clone2() and it takes, additionally,
the size of the stack area passed by the caller.
Add a do_clone() wrapper handling the different cases, and also
taking care of pointing the child's stack in the middle of the
allocated area: on PA-RISC (hppa), handled by clone(), the stack
grows up, and on ia64 the stack grows down, but the register backing
store grows up -- and I think it might be actually used here.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If stderr is closed, after we fork to background, glibc's
implementation of perror() will try to re-open it by calling dup(),
upon which the seccomp filter causes the process to terminate,
because dup() is not included in the list of allowed syscalls.
Replace perror() calls that might happen after isolation_postfork().
We could probably replace all of them, but early ones need a bit more
attention as we have to check whether log.c functions work in early
stages.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In a few places we use the FWRITE() macro to open a file, replace it's
contents with a given string and close it again. There's no real
reason this needs to be a macro rather than just a function though.
Turn it into a function 'write_file()' and make some ancillary
cleanups while we're there:
- Add a return code so the caller can handle giving a useful error message
- Handle the case of short write()s (unlikely, but possible)
- Add O_TRUNC, to make sure we replace the existing contents entirely
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Since kernel version 5.7, commit c427bfec18f2 ("net: core: enable
SO_BINDTODEVICE for non-root users"), we can bind sockets to
interfaces, if they haven't been bound yet (as in bind()).
Introduce an optional interface specification for forwarded ports,
prefixed by %, that can be passed together with an address.
Reported use case: running local services that use ports we want
to have externally forwarded:
https://github.com/containers/podman/issues/14425
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Coverity now noticed we're checking most lseek() return values, but
not this. Not really relevant, but it doesn't hurt to check we can
actually seek before reading lines.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Logging to file is going to add some further complexity that we don't
want to squeeze into util.c.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
clang-tidy complains that we're not checking the result of vfprintf in
logfn(). There's not really anything we can do if this fails here, so just
suppress the error with a cast to void.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
passt/pasta can interact with user namespaces in a number of ways:
1) With --netns-only we'll remain in our original user namespace
2) With --userns or a PID option to pasta we'll join either the given
user namespace or that of the PID
3) When pasta spawns a shell or command we'll start a new user namespace
for the command and then join it
4) With passt we'll create a new user namespace when we sandbox()
ourself
However (3) and (4) turn out to have essentially the same effect. In both
cases we create one new user namespace. The spawned command starts there,
and passt/pasta itself will live there from sandbox() onwards.
Because of this, we can simplify user namespace handling by moving the
userns handling earlier, to the same point we drop root in the original
namespace. Extend the drop_user() function to isolate_user() which does
both.
After switching UID and GID in the original userns, isolate_user() will
either join or create the userns we require. When we spawn a command with
pasta_start_ns()/pasta_setup_ns() we no longer need to create a userns,
because we're already made one. sandbox() likewise no longer needs to
create (or join) an userns because we're already in the one we need.
We no longer need c->pasta_userns_fd, since the fd is only used locally
in isolate_user(). Likewise we can replace c->netns_only with a local
in conf(), since it's not used outside there.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
passt/pasta contains a number of routines designed to isolate passt from
the rest of the system for security. These are spread through util.c and
passt.c. Move them together into a new isolation.c file.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Currently the logic to work out what UID and GID we will run as is spread
across conf(). If --runas is specified it's handled in conf_runas(),
otherwise it's handled by check_root(), which depends on initialization of
the uid and gid variables by either conf() itself or conf_runas().
Make this clearer by putting all the UID and GID logic into a single
conf_ugid() function.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
check_root() both checks to see if we are root (in the init namespace),
and if we are drops to an unprivileged user. To make future cleanups
simpler, split the checking for root (now in check_root()) from the actual
dropping of privilege (now in drop_root()).
Note that this does slightly alter semantics. Previously we would only
setuid() if we were originally root (in the init namespace). Now we will
always setuid() and setgid(), though it won't actually change anything if
we weren't privileged to begin with. This also means that we will now
always attempt to switch to the user specified with --runas, even if we
aren't (init namespace) root to begin with. Obviously this will fail with
an error if we weren't privileged to start with. --help and the man page
are updated accordingly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
c->uid and c->gid are first set in conf(), and last used in check_root()
itself called from conf(). Therefore these don't need to be fields in the
long lived context structure and can instead be locals in conf().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Commit a951e0b9ef ("conf: Add --runas option, changing to given UID
and GID if started as root") dropped the call to initgroups() that
used to add supplementary groups corresponding to the user we'll
eventually run as -- we don't need those.
However, if the original user belongs to supplementary groups
(usually not the case, if started as root), we don't drop those,
now, and rpmlint says:
passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt
passt.x86_64: E: missing-call-to-setgroups-before-setuid /usr/bin/passt.avx2
Add a call to setgroups() with an empty set, to drop any
supplementary group we might currently have, before changing GID
and UID.
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
The context structure contains a batch of fields specific to IPv4 and to
IPv6 connectivity. Split those out into a sub-structure.
This allows the conf_ip4() and conf_ip6() functions, which take the
entire context but touch very little of it, to be given more specific
parameters, making it clearer what it affects without stepping through the
code.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
It's quite plausible for a host to have both IPv4 and IPv6 connectivity,
but only via different interfaces. For example, this will happen in the
case that IPv6 connectivity is via a tunnel (e.g. 6in4 or 6rd). It would
also happen in the case that IPv4 access is via a tunnel on an otherwise
IPv6 only local network, which is a setup that might become more common in
the post IPv4 address exhaustion world.
In turns out there's no real need for passt/pasta to get its IPv4 and IPv6
connectivity via the same interface, so we can handle this situation fairly
easily. Change the core to allow eparate external interfaces for IPv4 and
IPv6. We don't actually set these separately for now.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
On some systems, user and group "nobody" might not be available. The
new --runas option allows to override the default "nobody" choice if
started as root.
Now that we allow this, drop the initgroups() call that was used to
add any additional groups for the given user, as that might now
grant unnecessarily broad permissions. For instance, several
distributions have a "kvm" group to allow regular user access to
/dev/kvm, and we don't need that in passt or pasta.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>