passt

mirror of https://passt.top/passt synced 2024-12-31 18:15:23 +00:00

Author	SHA1	Message	Date
Laurent Vivier	377b666dc9	udp: rename udp_sock_handler() to udp_buf_sock_handler() We are going to introduce a variant of the function to use vhost-user buffers rather than passt internal buffers. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:34 +02:00
Laurent Vivier	e7ac995217	udp: refactor UDP header update functions This commit refactors the udp_update_hdr4() and udp_update_hdr6() functions to improve code portability by replacing the udp_meta_t parameter with more specific parameters for the IPv4 and IPv6 headers (iphdr/ipv6hdr) and the source socket address (sockaddr_in/sockaddr_in6). It also moves the tap_hdr_update() function call inside the udp_tap_send() function not to have to pass the TAP header to udp_update_hdr4() and udp_update_hdr6() This refactor reduces complexity by making the functions more modular and ensuring that each function operates on more narrowly scoped data structures. This will facilitate future backend introduction like vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:32 +02:00
Laurent Vivier	9ecf7fedc5	tap: refactor packets handling functions Consolidate pool_tap4() and pool_tap6() into tap_flush_pools(), and tap4_handler() and tap6_handler() into tap_handler(). Create a generic tap_add_packet() to consolidate packet addition logic and reduce code duplication. The purpose is to ease the export of these functions to use them with the vhost-user backend. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:19 +02:00
Laurent Vivier	fba2b544b6	tcp: move buffers management functions to their own file Move all the TCP parts using internal buffers to tcp_buf.c and keep generic TCP management functions in tcp.c. Add tcp_internal.h to export needed functions from tcp.c and tcp_buf.h from tcp_buf.c With this change we can use existing TCP functions with a different kind of memory storage as for instance the shared memory provided by the guest via vhost-user. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:45:05 +02:00
Laurent Vivier	ec26fa013a	tcp: extract buffer management from tcp_send_flag() This commit isolates the internal data structure management used for storing data (e.g., tcp4_l2_flags_iov[], tcp6_l2_flags_iov[], tcp4_flags_ip[], tcp4_flags[], ...) from the tcp_send_flag() function. The extracted functionality is relocated to a new function named tcp_fill_flag_header(). tcp_fill_flag_header() is now a generic function that accepts parameters such as struct tcphdr and a data pointer. tcp_send_flag() utilizes this parameter to pass memory pointers from tcp4_l2_flags_iov[] and tcp6_l2_flags_iov[]. This separation sets the stage for utilizing tcp_prepare_flags() to set the memory provided by the guest via vhost-user in future developments. Signed-off-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-13 15:43:35 +02:00
David Gibson	d949667436	cppcheck: Suppress constParameterCallback errors We have several functions which are used as callbacks for NS_CALL() which only read their void * parameter, they don't write it. The constParameterCallback warning in cppcheck 2.14.1 complains that this parameter could be const void , also pointing out that that would require casting the function pointer when used as a callback. Casting the function pointers seems substantially uglier than using a non-const void as the parameter, especially since in each case we cast the void * to a const pointer of specific type immediately. So, suppress these errors. I think it would make logical sense to suppress this globally, but that would cause unmatchedSuppression errors on earlier cppcheck versions. So, instead individually suppress it, along with unmatchedSuppression in the relevant places. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-08 13:06:20 +02:00
Derek Schrock	8a83b530fe	selinux: Allow access to user_devpts Allow access to user_devpts. $ pasta --version pasta 0^20240510.g7288448-1.fc40.x86_64 ... $ awk '' < /dev/null $ pasta --version $ While this might be a awk bug it appears pasta should still have access to devpts. Signed-off-by: Derek Schrock <dereks@lifeofadishwasher.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	ec416fdcc4	tcp, flow: Fix some error paths which didn't clean up flows properly Flow table entries need to be fully initialised before returning to the main epoll loop. Commit `0060acd1` ("flow: Clarify and enforce flow state transitions") now enforces that: once a flow is allocated we must either cancel it, or activate it before returning to the main loop, or we will hit an ASSERT(). Some error paths in tcp_conn_from_tap() weren't correctly updated for this requirement - we can exit with a flow entry incompletely initialised. Correct that by cancelling the flows in those situations. I don't have enough information to be certain if this is the cause for podman bug 22925, but it plausibly could be. Fixes: `0060acd11b` ("flow: Clarify and enforce flow state transitions") Link: https://github.com/containers/podman/issues/22925 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	3f63743a65	util: Use 'long' to represent millisecond durations timespec_diff_ms() returns an int representing a duration in milliseconds. This will overflow in about 25 days when an int is 32 bits. The way we use this function, we're probably not going to get a result that long, but it's not outrageously implausible. Use a long for safety. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	f9e8ee0777	lineread: Use ssize_t for line lengths Functions and structures in lineread.c use plain int to record and report the length of lines we receive. This means we truncate the result from read(2) in some circumstances. Use ssize_t to avoid that. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	c919bbbdd3	conf: Safer parsing of MAC addresses In conf() we parse a MAC address in two places, for the --ns-mac-addr and the -M options. As well as duplicating code, the logic for this parsing has several bugs: * The most serious is that if the given string is shorter than a MAC address should be, we'll access past the end of it. * We don't check the endptr supplied by strtol() which means we could ignore certain erroneous contents * We never check the separator characters between each octet * We ignore certain sorts of garbage that follow the MAC address Correct all these bugs in a new parse_mac() helper. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	bda80ef53f	util: Use unsigned indices for bits in bitmaps A negative bit index in a bitmap doesn't make sense. Avoid this by construction by using unsigned indices. While we're there adjust bitmap_isset() to return a bool instead of an int. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	0e36fe1a43	clang-tidy: Enable the bugprone-macro-parentheses check We globally disabled this, with a justification lumped together with several checks about braces. They don't really go together, the others are essentially a stylistic choice which doesn't match our style. Omitting brackets on macro parameters can lead to real and hard to track down bugs if an expression is ever passed to the macro instead of a plain identifier. We've only gotten away with the macros which trigger the warning, because of other conventions its been unlikely to invoke them with anything other than a simple identifier. Fix the macros, and enable the warning for the future. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	7094b91d10	Remove pointless macro parameters in CALL_PROTO_HANDLER The 'c' parameter is always passed exactly 'c'. The 'now' parameter is always passed exactly 'now'. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	c80fa6a6bb	udp: Make rport calculation more local cppcheck 2.14.1 complains about the rport variable not being in as small as scope as it could be. It's also only used once, so we might as well just open code the calculation for it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	d2afb4b625	tcp: Make pointer const in tcp_revert_seq The th pointer could be const, which causes a cppcheck warning on at least some cppcheck versions (e.g. Cppcheck 2.13.0 in Fedora 40). Fixes: `e84a01e94c` ("tcp: move seq_to_tap update to when frame is queued") Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-07 20:44:44 +02:00
David Gibson	b3aeb004ea	log: Remove log_to_stdout option Now that we've simplified how usage() works, nothing ever sets the log_to_stdout flag. Eliminate it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-05 21:14:09 +02:00
David Gibson	7cb2088835	conf: Don't print usage via the logging subsystem The message from usage() when given invalid options, or the -h / --help option is currently printed by many calls to the info() function, also used for runtime logging of informational messages. That isn't useful: the usage message should always go to the terminal (stdout or stderr), never syslog or a logfile. It should never be filtered by priority. Really the only thing using the common logging functions does is give more opportunities for something to go wrong. Replace all the info() calls with direct fprintf() calls. This does mean manually adding "\n" to each message. A little messy, but worth it for the simplicity in other dimensions. While we're there make much heavier use of single strings containing multiple lines of output text. That reduces the number of fprintf calls, reducing visual clutter and making it easier to see what the output will look like from the source. Link: https://bugs.passt.top/show_bug.cgi?id=90 Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-05 21:14:06 +02:00
David Gibson	e651197b5c	conf: Remove unhelpful usage() wrapper usage() does nothing but call print_usage() with EXIT_FAILURE as a parameter. It's no more complex to just give that parameter at the single call site. Eliminate it and rename print_usage() to just usage(). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-05 21:14:03 +02:00
Jon Maloy	e84a01e94c	tcp: move seq_to_tap update to when frame is queued commit `a469fc393f` ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem. This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue. We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value. Signed-off-by: Jon Maloy <jmaloy@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-06-05 21:13:16 +02:00
Stefano Brivio	765eb0bf16	apparmor: Fix comments after PID file and AF_UNIX socket creation refactoring Now: - we don't open the PID file in main() anymore - PID file and AF_UNIX socket are opened by pidfile_open() and tap_sock_unix_open() - write_pidfile() becomes pidfile_write() Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:44:21 +02:00
Stefano Brivio	0608ec42f2	conf, passt.h: Rename pid_file in struct ctx to pidfile We have pidfile_fd now, pid_file_fd would be quite ugly. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:44:14 +02:00
Stefano Brivio	c9b2413465	conf, passt, tap: Open socket and PID files before switching UID/GID Otherwise, if the user runs us as root, and gives us paths that are only accessible by root, we'll fail to open them, which might in turn encourage users to change permissions or ownerships: definitely a bad idea in terms of security. Reported-by: Minxi Hou <mhou@redhat.com> Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:26 +02:00
Stefano Brivio	ba23b05545	passt, util: Move opening of PID file to its own function We won't call it from main() any longer: move it. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:13 +02:00
Stefano Brivio	57d8aa8ffe	util: Rename write_pidfile() to pidfile_write() As I'm adding pidfile_open() in the next patch. The function comment didn't match, by the way. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:43:05 +02:00
Stefano Brivio	cbca08cd38	tap: Split tap_sock_unix_init() into opening and listening parts We'll need to open and bind the socket a while before listening to it, so split that into two different functions. No functional changes intended. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:42:43 +02:00
Stefano Brivio	fcfb592adc	passt, tap: Don't use -1 as uninitialised value for fd_tap_listen This is a remnant from the time we kept access to the original filesystem and we could reinitialise the listening AF_UNIX socket. Since commit `0515adceaa` ("passt, pasta: Namespace-based sandboxing, defer seccomp policy application"), however, we can't re-bind the listening socket once we're up and running. Drop the -1 initalisation and the corresponding check. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-05-23 16:42:27 +02:00
Stefano Brivio	d02bb6ca05	tap: Move all-ones initialisation of mac_guest to tap_sock_init() It has nothing to do with tap_sock_unix_init(). It used to be there as that function could be called multiple times per passt instance, but it's not the case anymore. This also takes care of the fact that, with --fd, we wouldn't set the initial MAC address, so we would need to wait for the guest to send us an ARP packet before we could exchange data. Fixes: `6b4e68383c` ("passt, tap: Add --fd option") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:42:06 +02:00
Stefano Brivio	45b8632dcc	conf: Don't lecture user about starting us as root libguestfs tools have a good reason to run as root: if the guest image is owned by root, it would be counterproductive to encourage users to invoke them as non-root, as it would require changing permissions or ownership of the image file. And if they run as root, we'll start as root, too. Warn users we'll switch to 'nobody', but don't tell them what to do. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Richard W.M. Jones <rjones@redhat.com>	2024-05-23 16:40:33 +02:00
David Gibson	3f917b326b	netlink, test: Ignore deprecated addresses When we retrieve or copy host addresses we can include deprecated addresses, which is not what we want. Adjust our logic to exclude them. Similarly our tests can retrieve deprecated addresses, so exclude them there too. I hit this in practice because my router sometimes temporarily advertises an fd00:: prefix before the real delegated IPv6 prefix. The deprecated address can hang around for some time messing up my tests. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:21:09 +02:00
David Gibson	cc801fb38f	tcp: Remove interim 'tapside' field from connection We recently introduced this field to keep track of which side of a TCP flow is the guest/tap facing one. Now that we generically record which pif each side of each flow is connected to, we can easily derive that, and no longer need to keep track of it explicitly. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:21:06 +02:00
David Gibson	8a2accb847	flow: Record the pifs for each side of each flow Currently we have no generic information flows apart from the type and state, everything else is specific to the flow type. Start introducing generic flow information by recording the pifs which the flow connects. To keep track of what information is valid, introduce new flow states: INI for when the initiating side information is complete, and TGT for when both sides information is complete, but we haven't chosen the flow type yet. For now, these states don't do an awful lot, but they'll become more important as we add more generic information. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:21:03 +02:00
David Gibson	43571852e6	flow: Make side 0 always be the initiating side Each flow in the flow table has two sides, 0 and 1, representing the two interfaces between which passt/pasta will forward data for that flow. Which side is which is currently up to the protocol specific code: TCP uses side 0 for the host/"sock" side and 1 for the guest/"tap" side, except for spliced connections where it uses 0 for the initiating side and 1 for the target side. ICMP also uses 0 for the host/"sock" side and 1 for the guest/"tap" side, but in its case the latter is always also the initiating side. Make this generically consistent by always using side 0 for the initiating side and 1 for the target side. This doesn't simplify a lot for now, and arguably makes TCP slightly more complex, since we add an extra field to the connection structure to record which is the guest facing side. This is an interim change, which we'll be able to remove later. Signed-off-by: David Gibson <david@gibson.dropbear.id.au>q Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:21:01 +02:00
David Gibson	0060acd11b	flow: Clarify and enforce flow state transitions Flows move over several different states in their lifetime. The rules for these are documented in comments, but they're pretty complex and a number of the transitions are implicit, which makes this pretty fragile and error prone. Change the code to explicitly track the states in a field. Make all transitions explicit and logged. To the extent that it's practical in C, enforce what can and can't be done in various states with ASSERT()s. While we're at it, tweak the docs to clarify the restrictions on each state a bit. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:20:58 +02:00
David Gibson	a63199832a	inany: Better helpers for using inany and specific family addrs together This adds some extra inany helpers for comparing an inany address to addresses of a specific family (including special addresses), and building an inany from an IPv4 address (either statically or at runtime). Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:20:55 +02:00
David Gibson	7a832a8a0e	flow: Properly type callbacks to protocol specific handlers The flow dispatches deferred and timer handling for flows centrally, but needs to call into protocol specific code for the handling of individual flows. Currently this passes a general union flow *. It makes more sense to pass the specific relevant flow type structure. That brings the check on the flow type adjacent to casting to the union variant which it tags. Arguably, this is a slight abstraction violation since it involves the generic flow code using protocol specific types. It's already calling into protocol specific functions, so I don't think this really makes any difference. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:20:52 +02:00
David Gibson	1a20370b36	util, tcp: Add helper to display socket addresses When reporting errors, we sometimes want to show a relevant socket address. Doing so by extracting the various relevant fields can be pretty awkward, so introduce a sockaddr_ntop() helper to make it simpler. For now we just have one user in tcp.c, but I have further upcoming patches which can make use of it. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:20:37 +02:00
Maxime Bélair	3ff3a8a467	apparmor: Fix passt abstraction Commit `b686afa2` introduced the invalid apparmor rule `mount options=(rw, runbindable) /,` since runbindable mount rules cannot have a source. Therefore running aa-logprof/aa-genprof will trigger errors (see https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685) $ sudo aa-logprof ERROR: Operation {'runbindable'} cannot have a source. Source = AARE('/') This patch fixes it to the intended behavior. Link: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685 Fixes: `b686afa23e` ("apparmor: Explicitly pass options we use while remounting root filesystem") Signed-off-by: Maxime Bélair <maxime.belair@canonical.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-22 23:16:27 +02:00
Paul Holzinger	6cdc9fd51b	apparmor: allow netns paths on /tmp For some unknown reason "owner" makes it impossible to open bind mounted netns references as apparmor denies it. In the kernel denied log entry we see ouid=0 but it is not clear why that is as the actual file is owned by the real (rootless) user id. In abstractions/pasta there is already `@{run}/user/@{uid}/**` without owner set for the same reason as this path contains the netns path by default when running under Podman. Fixes: `72884484b0` ("apparmor: allow read access on /tmp for pasta") Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-13 23:02:18 +02:00
David Gibson	80f7ff2996	clang-tidy: Suppress macro to enum conversion warnings clang-tidy 18.1.1 in Fedora 40 complains about every #define of an integral value, suggesting it be converted to an enum. Although that's certainly possible, it's of dubious value and results in some awkward arrangements on out codebase. Suppress it globally. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-13 23:02:07 +02:00
David Gibson	29bd08ff0f	conf: Fix clang-tidy warning about using an undefined enum value In conf() we temporarily set the forwarding mode variables to 0 - an invalid value, so that we can check later if they've been set by the intervening logic. clang-tidy 18.1.1 in Fedora 40 now complains about this. Satisfy it by giving an name in the enum to the 0 value. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-13 23:02:05 +02:00
lemmi	26c71db332	passt.c: explicitly include libgen.h for basename fixes implicit declaration warning on musl Signed-off-by: lemmi <lemmi@nerd2nerd.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-13 23:00:52 +02:00
Stefano Brivio	623c2fd621	netlink: Don't duplicate routes referring to unrelated host interfaces We take care of this in nl_addr_dup(): if the interface index associated to an address doesn't match the selected host interface (ifa->ifa_index != ifi_src), we don't copy that address. But for routes, we just unconditionally update the interface index to match the index in the target namespace, even if the source interface didn't match. This might happen in two cases: with a pre-4.20 kernel without support for NETLINK_GET_STRICT_CHK, which won't filter routes based on the interface we pass in the request, as reported by runsisi, and any kernel with support for multipath routes where any of the nexthops refers to an unrelated host interface. In both cases, check the index of the source interface, and avoid copying unrelated routes. Reported-by: runsisi <runsisi@hust.edu.cn> Link: https://bugs.passt.top/show_bug.cgi?id=86 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Tested-by: runsisi <runsisi@hust.edu.cn> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-05-11 00:52:19 +02:00
Paul Holzinger	72884484b0	apparmor: allow read access on /tmp for pasta The podman CI on debian runs tests based on /tmp but pasta is failing there because it is unable to open the netns path as the open for read access is denied. Link: https://github.com/containers/podman/issues/22625 Signed-off-by: Paul Holzinger <pholzing@redhat.com> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-10 16:53:35 +02:00
Stefano Brivio	7e6a606c32	tcp_splice: Set OUT_WAIT_ flag whenever pipe isn't emptied In tcp_splice_sock_handler(), if we get EAGAIN on the second splice(), from pipe to receiving socket, that doesn't necessarily mean that the pipe is empty: the receiver buffer might be full instead. Hence, we can't use the 'never_read' flag to decide that there's nothing to wait for: even if we didn't read anything from the sending side in a given iteration, we might still have data to send in the pipe. Use read/written counters, instead. This fixes an issue where large bulk transfers would occasionally hang. From a corresponding strace: 0.000061 epoll_wait(4, [{events=EPOLLOUT, data={u32=29442, u64=12884931330}}], 8, 1000) = 1 0.005003 epoll_ctl(4, EPOLL_CTL_MOD, 211, {events=EPOLLIN\|EPOLLRDHUP, data={u32=54018, u64=8589988610}}) = 0 0.000089 epoll_ctl(4, EPOLL_CTL_MOD, 115, {events=EPOLLIN\|EPOLLRDHUP, data={u32=29442, u64=12884931330}}) = 0 0.000081 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000073 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = 1048576 0.000087 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000045 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = 520415 0.000060 splice(211, NULL, 151, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000044 splice(150, NULL, 115, NULL, 1048576, SPLICE_F_MOVE\|SPLICE_F_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) 0.000044 epoll_wait(4, [], 8, 1000) = 0 we're reading from socket 211 into to the pipe end numbered 151, which connects to pipe end 150, and from there we're writing into socket 115. We initially drop EPOLLOUT from the set of monitored flags for socket 115, because it already signaled it's ready for output. Then we read nothing from socket 211 (the sender had nothing to send), and we keep emptying the pipe into socket 115 (first 1048576 bytes, then 520415 bytes). This call of tcp_splice_sock_handler() ends with EAGAIN on the writing side, and we just exit this function without setting the OUT_WAIT_1 flag (and, in turn, EPOLLOUT for socket 115). However, it turns out, the pipe wasn't actually emptied, and while socket 211 had nothing more to send, we should have waited on socket 115 to be ready for output again. As a further step, we could consider not clearing EPOLLOUT at all, unless the read/written counters match, but I'm first trying to fix this ugly issue with a minimal patch. Link: https://github.com/containers/podman/issues/22575 Link: https://github.com/containers/podman/issues/22593 Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>	2024-05-10 16:52:29 +02:00
David Gibson	1ba76c9e8c	udp: Single buffer for IPv4, IPv6 headers and metadata Currently we have separate arrays for IPv4 and IPv6 which contain the headers for guest-bound packets, and also the originating socket address. We can combine these into a single array of "metadata" structures with space for both pre-cooked IPv4 and IPv6 headers, as well as shared space for the tap specific header and socket address (using sockaddr_inany). Because we're using IOVs to separately address the pieces of each frame, these structures don't need to be packed to keep the headers contiguous so we can more naturally arrange for the alignment we want. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:52 +02:00
David Gibson	d4598e1d18	udp: Use the same buffer for the L2 header for all frames Currently each tap-bound frame buffer has room for its own ethernet header. However the ethernet header is always the same for such frames, so now that we're indirectly referencing the ethernet header via iov, we can use a single buffer for all of them. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:49 +02:00
David Gibson	6170688616	udp: Share payload buffers between IPv4 and IPv6 Currently the IPv4 and IPv6 paths unnecessarily use different buffers for the UDP payload. Now that we're handling the various pieces of the UDP packets with an iov, we can split the payload part of the buffers off into its own array shared between IPv4 and IPv6. As well as saving a little memory, this allows the payload buffers to be neatly page aligned. With the buffers merged, udp[46]_l2_iov_sock contain exactly the same thing as each other and can also be merged. Likewise udp[46]_iov_splice can be merged together. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:46 +02:00
David Gibson	2d16946bac	udp: Explicitly set checksum in guest-bound UDP headers For IPv4, UDP checksums are optional and can just be set to 0. udp_update_hdr4() ignores the checksum field entirely. Since these are set to 0 during startup, this works as intended for now. However, we'd like to share payload and UDP header buffers betweem IPv4 and IPv6, which does calculate UDP checksums. Therefore, for robustness, we should explicitly set the checksum field to 0 for guest-bound UDP packets. In the tap_udp4_send() slow path, however, we do allow IPv4 UDP checksums to be calculated as a compile time option. For consistency, use the same thing in the udp_update_hdr4() path, which will typically initialize to 0, but calculate a real checksum if configured to do so. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:44 +02:00
David Gibson	6c4d26a364	udp: Combine initialisation of IPv4 and IPv6 iovs We're going to introduce more sharing between the IPv4 and IPv6 buffer structures. Prepare for this by combinng the initialisation functions. While we're at it remove the misleading "sock" from the name since these initialise both tap side and sock side structures. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>	2024-05-02 16:13:41 +02:00

1 2 3 4 5 ...

1485 Commits