c->mac isn't a great name, because it doesn't say whose mac address it is
and it's not necessarily obvious in all the contexts we use it. Since this
is specifically the address that we (passt/pasta) use on the tap interface,
rename it to "our_tap_mac". Rename the "mac_guest" field to "guest_mac"
to be grammatically consistent.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
There are a couple of places where we somewhat messily open code formatting
an Ethernet like MAC address for display. Add an eth_ntop() helper for
this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The term "forwarding address" to indicate the local-to-passt address was
well-intentioned, but ends up being kinda confusing. As discussed on a
recent call, let's try "our" instead.
(While we're there correct an error in flow_initiate_af()s comments where
we referred to parameters by the wrong name).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
As soon as we the kernel notifier for IPv6 address configuration
(addrconf_notify()) sees that we bring the target interface up
(NETDEV_UP), it will schedule duplicate address detection, so, by
itself, setting the nodad flag later is useless, because that won't
stop a detection that's already in progress.
However, if we disable neighbour solicitations with IFF_NOARP (which
is a misnomer for IPv6 interfaces, but there's no possibility of
mixing things up), the notifier will not trigger DAD, because it can't
be done, of course, without neighbour solicitations.
Set IFF_NOARP as we bring up the device, and drop it after we had a
chance to set the nodad attribute on the link.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As soon as we bring up the interface, the Linux kernel will set up a
link-local address for it, so we can fetch it and start using right
away, if we need a link-local address to communicate to the container
before we see any traffic coming from it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
It makes no sense for a container or a guest to try and perform
duplicate address detection for their link-local address, as we'll
anyway not relay neighbour solicitations with an unspecified source
address.
While they perform duplicate address detection, the link-local address
is not usable, which prevents us from bringing up especially
containers and communicate with them right away via IPv6.
This is not enough to prevent DAD and reach the container right away:
we'll need a couple more patches.
As we send NLM_F_REPLACE requests right away, while we still have to
read out other addresses on the same socket, we can't use nl_do():
keep track of the last sequence we sent (last address we changed), and
deal with the answers to those NLM_F_REPLACE requests in a separate
loop, later.
Link: https://github.com/containers/podman/pull/23561#discussion_r1711639663
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In the next patches, we'll reuse it to set flags other than IFF_UP.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
As we'll use nl_link_up() for more than just bringing up devices, it
will become awkward to carry empty MTU values around whenever we call
it.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
We have a number of delays when we switch to new layouts that were
added to make the tests visually easier to follow, together with
blinking status bars. Shorten the delays and avoid blinking the
status bar if $FAST is set to 1 (no demo mode).
Shorten delays in busy loops to 10ms, instead of 100ms, and skip the
one-second fixed delay when we wait for the status of a command.
Cut the duration of throughput and latency tests to one second, down
from ten. Somewhat surprisingly, the results we get are rather
consistent, and not significantly different from what we'd get with
10 seconds.
This, together with Podman's commit 20f3e8909e3a ("test/system:
pasta_test_do add explicit port check"), cuts the time needed on my
setup for full test run from approximately 37 minutes to...:
$ time ./run
[exited]
PASS: 165, FAIL: 0
Log at /home/sbrivio/passt/test/test_logs/test.log
real 15m34.253s
user 0m0.011s
sys 0m0.011s
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Using a zero port on TCP or UDP is dubious, and we can't really deal with
forwarding such a flow within the constraints of the socket API. Hence
we ASSERT()ed that we had non-zero ports in flow_hash().
The intention was to make sure that the protocol code sanitizes such ports
before completing a flow entry. Unfortunately, flow_hash() is also called
on new packets to see if they have an existing flow, so the unsanitized
guest packet can crash passt with the assert.
Correct this by moving the assert from flow_hash() to flow_sidx_hash()
which is only used on entries already in the table, not on unsanitized
data.
Reported-by: Matt Hamilton <matt@thmail.io>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
f6d5a5239264 moved handling of -D into a later loop. However as a side
effect it moved this from a switch block to an if block. I left a couple
of 'break' statements that don't make sense in the new context. They
should be 'continue' so that we go onto the next option, rather than
leaving the loop entirely.
Fixes: f6d5a5239264 ("conf: Delay handling -D option until after addresses are configured")
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
- Add structs for NA, RA, NS, MTU, prefix info, option header,
link-layer address, RDNSS, DNSSL and link-layer for RA message.
- Turn NA message from purely imperative, going byte by byte,
to declarative by filling it's struct.
- Turn part of RA message into declarative.
- Move packet_add() to be before the call of ndp() in tap6_handler()
if the protocol of the packet is ICMPv6.
- Add a pool of packets as an additional parameter to ndp().
- Check the size of NS packet with packet_get() before sending an NA
packet.
- Add documentation for the structs.
- Add an enum for NDP option types.
Link: https://bugs.passt.top/show_bug.cgi?id=21
Signed-off-by: AbdAlRahman Gad <abdobngad@gmail.com>
[sbrivio: Minor coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
add_dns[46]() rely on the gateway address and c->no_map_gw being already
initialised, in order to properly handle DNS servers which need NAT to be
accessed from the guest.
Usually these are called from get_dns() which is well after the addresses
are configured, so that's fine. However, they can also be called earlier
if an explicit -D command line option is given. In this case no_map_gw
and/or c->ip[46].gw may not get be initialised properly, leading to this
doing the wrong thing.
Luckily we already have a second pass of option parsing for things which
need addresses to already be configured. Move handling of -D to there.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
These fields are described as being an address for an external, routable
interface. That's not necessarily the case when using -a. But, more
importantly, saying where the value comes from is not as useful as what
it's used for. The real purpose of this field is as the address which we
assign to the guest via DHCP or --config-net.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we prefix the second part of messages printed through
logmsg_perror() by the timestamp, on debug, we'll have two timestamps
and a weird separator in the result, such as this beauty:
0.0013: Failed to clone process with detached namespaces0.0013: : Operation not permitted
Add a parameter to logmsg() and vlogmsg() which indicates a message
continuation. If that's set, don't print the timestamp in vlogmsg().
Link: https://github.com/moby/moby/issues/48257#issuecomment-2282875092
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Given that pasta supports specifying a command to be executed on the
command line, even without the usual -- separator as long as there's
no ambiguity, we shouldn't eat up options that are not meant for us.
Paul reports, for instance, that with:
pasta --config-net ip -6 route
-6 is taken by pasta to mean --ipv6-only, and we execute 'ip route'.
That's because getopt_long(), by default, shuffles the argument list
to shift non-option arguments at the end.
Avoid that by adding '+' at the beginning of 'optstring'.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams
and the one from --fd.
Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.
As we need to check that the passed --fd option doesn't exceed
INT_MAX, because we'll parse it with strtol() but file descriptor
indices are signed ints (regardless of the arguments close_range()
take), extend the existing check in the actual --fd parsing in conf(),
also rejecting file descriptors numbers that match standard streams,
while at it.
Suggested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Particularly in shell it's sometimes natural to save the pid from a process
run and later kill it. If doing this with nstool exec, however, it will
kill nstool itself, not the program it is running, which isn't usually what
you want or expect.
Address this by having nstool propagate SIGTERM to its child process. It
may make sense to propagate some other signals, but some introduce extra
complications, so we'll worry about them when and if it seems useful.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We use logtime() to get a timestamp for the log in two places:
- in vlogmsg(), which is used only for debug_print messages
- in logfile_write() which is only used messages to the log file
These cases are mutually exclusive, so we don't ever print the same message
with different timestamps, but that's not particularly obvious to see.
It's possible future tweaks to logging logic could mean we log to two
different places with different timestamps, which would be confusing.
Refactor to have a single logtime() call in vlogmsg() and use it for all
the places we need it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
clock_gettime() can, theoretically, fail, although it probably won't until
2038 on old 32-bit systems. Still, it's possible someone could run with
a wildly out of sync clock, or new errors could be added, or it could fail
due to a bug in libc or the kernel.
We don't handle this well. In the debug_print case in vlogmsg we'll just
ignore the failure, and print a timestamp based on uninitialised garbage.
In logfile_write() we exit early and won't log anything at all, which seems
like a good way to make an already weird situation undebuggable.
Add some helpers to instead handle this by using "<error>" in place of a
timestamp if something goes wrong with clock_gettime().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
logtime_fmt_and_arg() is a rather odd macro, producing both a format
string and an argument, which can only be used in quite specific printf()
like formulations. It also has a significant bug: it tries to display 4
digits after the decimal point (so down to tenths of milliseconds) using
%04i. But the field width in printf() is always a *minimum* not maximum
field width, so this will not truncate the given value, but will redisplay
the entire tenth-of-milliseconds difference again after the decimal point.
Replace the macro with an snprintf() like function which will format the
timestamp, and use an explicit % to correct the display.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Make logtime_fmt() static]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The comment for timespec_diff_us() claims it will wrap after 2^64µs. This
is incorrect for two reasons:
* It returns a long long, which is probably 64-bits, but might not be
* It returns a signed value, so even if it is 64 bits it will wrap after
2^63µs
Correct the comment and use an explicitly 64-bit type to avoid that
imprecision.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Paul reports that setting IPv4 address and gateway manually, using
--address and --gateway, causes pasta to fail inserting IPv6 routes
in a setup where multiple, inter-dependent IPv6 routes are present
on the host.
That's because, currently, any -g option implies --no-copy-routes
altogether, and any -a implies --no-copy-addrs.
Limit this implication to the matching IP version, instead, by having
two copies of no_copy_routes and no_copy_addrs in the context
structure, separately for IPv4 and IPv6.
While at it, change them to 'bool': we had them as 'int' because
getopt_long() used to set them directly, but it hasn't been the case
for a while already.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
There are two cases where we want to stop printing to stderr: if it's
closed, and if pasta spawned a shell (and --debug wasn't given).
But if passt is running in foreground, we currently stop to report any
message, even error messages, once we're ready, as reported by
Laurent, because we set the log_runtime flag, which we use to indicate
we're ready, regardless of whether we're running in foreground or not.
Turn that flag (back) to log_stderr, and set it only when we really
want to stop printing to stderr.
Reported-by: Laurent Vivier <lvivier@redhat.com>
Fixes: afd9cdc9bb48 ("log, passt: Always print to stderr before initialisation is complete")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If the "from" (input) side for a given transfer is 0, and we can't
complete the write right away, what we need to be waiting for is for
output readiness on side 1, not 0, and the other way around as well.
This causes random transfer failures for local TCP connections,
depending if we ever need to wait for output readiness.
Reported-by: Paul Holzinger <pholzing@redhat.com>
Link: https://github.com/containers/podman/issues/23517
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Tested-by: Paul Holzinger <pholzing@redhat.com>
The "correct" type for the length of an IOV is unclear: writev() and
readv() use an int, but sendmsg() and recvmsg() use a size_t. Using the
unsigned size_t has some advantages, though, and it makes more sense for
the case of write_remainder. Using size_t throughout here means we don't
have a signed vs. unsigned comparison, and we don't have to deal with
the case of iov_skip_bytes() returning a value which becomes negative
when assigned to an integer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
No code change.
They need to be exported to be available by the vhost-user version of
passt.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
To be used with the vhost-user version of udp.c, we need to export the
udp_flow functions. To avoid to export udp_meta_t too that is specific
to the socket version of udp.c, don't pass udp_meta_t to it,
but the only needed field, s_in.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
logfile_write() is not used outside log.c, nor should it be. It should
only be used externall via the general logging functions. Make it static
in log.c. To avoid forward declarations this requires moving a bunch of
functions earlier in the file.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Ed reported this:
# Error: pasta failed with exit code 1:
# Couldn't drop cap 3 from bounding set
# : No child processes
in a Podman CI run with tests being run in parallel. The error message
itself, by the way, is fixed by commit 1cd773081f12 ("log: Drop
newlines in the middle of the perror()-like messages"), but how can we
possibly get ECHILD as failure code for prctl()?
Well, we don't, but if we exit early enough, pasta_child_handler()
might run before we're even done with isolation steps, and it calls
waitid(), which sets errno. We need to restore it before returning
from the signal handler (if we return after calling functions that
might set it), as signal-safety(7) also implies:
Fetching and setting the value of errno is async-signal-safe
provided that the signal handler saves errno on entry and
restores its value before returning.
Eventually, we'll probably need to switch to signalfd(2) the day we
want to implement multithreading, but this will do for the moment.
Reported-by: Ed Santiago <santiago@redhat.com>
Link: https://github.com/containers/podman/issues/23478
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Paul Holzinger <pholzing@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
When invoking pasta without any arguments, it's difficult
to tell whether we are in the new namespace or not leaving
users a bit confused. This change modifies the host namespace
to add a prefix "pasta-" to make it a bit more obvious.
Signed-off-by: Danish Prakash <contact@danishpraka.sh>
[sbrivio: coding style fixes]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie. Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.
Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket. This handles a number of (unlikely) cases
which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the
remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
doing the partial frame handling, which would put us out of sync with
the stream from qemu
* If a the blocking recv() only received some of the remainder of the
frame, not all of it, we'd return leaving us out of sync with the
stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This
is probably acceptable because it's an unlikely case in practice. If
necessary we could mitigate this by using a true ring buffer.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself. As far as I can tell,
frames can be any length, with no particular alignment requirement. This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.
Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading. Some platforms will generate a fatal trap on
such an unaligned load. Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.
Introduce a new helper to safely load a possibly unaligned value here. We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads. If that turns out not
to be the case, we can look at improvements then.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket. It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling. That consideration
doesn't apply to the way we handle the qemu socket however.
Furthermore, using EPOLLET causes additional complications:
1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
we *do* set it when using pasta mode with --fd. This inconsistency
doesn't seem to have broken anything, but it's odd.
2) EPOLLET requires that tap_handler_passt() loop until all data available
is read (otherwise we may have data in the buffer but never get an event
causing us to read it). We do that with a rather ugly goto.
Worse, our condition for that goto appears to be incorrect. We'll only
loop if rem is non-zero, which will only happen if we perform a blocking
recv() for a partially received frame. We'll only perform that second
recv() if the original recv() resulted in a partially read frame. As
far as I can tell the original recv() could end on a frame boundary
(never triggering the second recv()) even if there is additional data in
the socket buffer. In that circumstance we wouldn't goto redo and could
leave unprocessed frames in the qemu socket buffer indefinitely.
This doesn't seem to have caused any problems in practice, but since
there's no obvious reason to use EPOLLET here anyway, we might as well
get rid of it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on. That sounds sensible on first blush, but
probably isn't wise in practice. If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket. Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error
and close the socket. As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error. Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.
Add an error message in this case. Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[sbrivio: Change error message: it's not necessarily QEMU, and mention
that we are resetting the connection]
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
We report relative timestamps in logs, so we want to avoid jumps in
the system time.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.
Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
For some reason, in commit 01efc71ddd25 ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.
Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating the difference from the starting time. Do
this in a macro as we need the same format in a few places.
To calculate the difference, turn the existing timespec_diff_ms() to
microseconds, timespec_diff_us(), and rewrite timespec_diff_ms() to
use that.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
systemd-resolved has the rather strange behaviour of listening on the
non-standard loopback address 127.0.0.53. Various changes we've made in
passt mean that we now usually work fine on a host using systemd-resolved.
However our tests still fail in this case. We have a special case for when
the guest's resolv.conf needs to differ from the host's because the
resolver is on a host loopback address. However, we only consider the case
where the host resolver is on 127.0.0.1, not other loopback addresses.
Correct this with a different test condition.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
passt/pasta has options to redirect DNS requests from the guest to a
different server address on the host side. Currently, however, only UDP
packets to port 53 are considered "DNS requests". This ignores DNS
requests over TCP - less common, but certainly possible. It also ignores
encrypted DNS requests on port 853.
Extend the DNS forwarding logic to handle both of those cases.
Link: https://github.com/containers/podman/issues/23239
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Currently, we start by handling the common case, where we don't translate
the destination address, then we modify the tgt side for the special cases.
In the process we do comparisons on the tentatively set fields in tgt,
which obscures the fact that tgt should be an essentially pure function of
ini, and risks people examining fields of tgt that are not yet initialized.
To make this clearer, do all our tests on 'ini', constructing tgt from
scratch on that basis.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Even though we don't use : as delimiter for the port, making square
brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6
literals. We want IPv6 addresses there, but some users might still
specify them out of habit.
Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for
IPv4 literals, but I had reports of users actually trying to use them
(they're accepted by many tools).
Allow square brackets for both IPv4 and IPv6 addresses, correct or
not, they're harmless anyway.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>