Commit Graph

17 Commits

Author SHA1 Message Date
Laine Stump
e9e5ebe6a6 network: add tc filter rule to nftables backend to fix checksum of DHCP responses
Please see the commit log for commit v10.9.0-rc1-1-g42ab0148dd for the
history and explanation of the problem that this patch is fixing.

A shorter explanation is that when a guest is connected to a libvirt
virtual network using a virtio-net adapter with in-kernel "vhost-net"
packet processing enabled, it will fail to acquire an IP address from
a DHCP seever running on the host.

In commit v10.9.0-rc1-1-g42ab0148dd we tried fixing this by *zeroing
out* the checksums of these packets with an nftables rule (nftables
can't recompute the checksum, but it can set it to 0) . This
*appeared* to work initially, but it turned out that zeroing the
checksum ends up breaking dhcp packets on *non* virtio/vhost-net guest
interfaces. That attempt was reverted in commit v10.9.0-rc2.

Fortunately, there is an existing way to recompute the checksum of a
packet as it leaves an interface - the "tc" (traffic control) utility
that libvirt already uses for bandwidth management. This patch uses a
tc filter rule to match dhcp response packets on the bridge and
recompute their checksum.

The filter rule must be attached to a tc qdisc, which may also have a
filter attached for bandwidth management (in the <bandwidth> element
of the network config). Not only must we add the qdisc only once
(which was already handled by the patch two prior to this one), but
also the filter rule for checksum fixing and the filter rule for
bandwidth management must be different priorities so they don't clash;
this is solved by adding the checksum-fix filter with "priority 2",
while the bandwidth management filter remains "priority 1" (both will
always be evaluated anyway, it's just a matter of which is evaluated
first).

So far this method has worked with every different guest we could
throw at it, including several that failed with the previous method.

Fixes: b89c4991da
Reported-by: Rich Jones <rjones@redhat.com>
Reported-by: Andrea Bolognani <abologna@redhat.com>
Fix-Suggested-by: Eric Garver <egarver@redhat.com>
Fix-Suggested-by: Phil Sutter <psutter@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2024-11-26 14:36:14 +01:00
Laine Stump
7581e3b6d5 Revert "network: add rule to nftables backend that zeroes checksum of DHCP responses"
This reverts commit 42ab0148dd.

This patch was supposed to fix the checksum of dhcp response packets
by setting it to 0 (because having a non-0 but incorrect checksum was
causing the packets to be droppe on FreeBSD guests).

Early testing was positive, but after the patch was pushed upstream
and more people could test it, it turned out that while it fixed the
dhcp checksum problem for virtio-net interfaces on FreeBSD and
OpenBSD, it also *broke* dhcp checksums for the e1000 emulated NIC on
*all* guests (but not e1000e).

So we're reverting this fix and looking for something more universal
to be included in the next release.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Andrea Bolognani <abologna@redhat.com>
2024-10-30 11:39:58 +01:00
Laine Stump
42ab0148dd network: add rule to nftables backend that zeroes checksum of DHCP responses
Many years ago (April 2010), soon after "vhost" in-kernel packet
processing was added to the virtio-net driver, people running RHEL5
virtual machines with a virtio-net interface connected via a libvirt
virtual network noticed that when vhost packet processing was enabled,
their VMs could no longer get an IP address via DHCP - the guest was
ignoring the DHCP response packets sent by the host.

(I've been informed by danpb that the same issue had been encountered,
and "fixed" even earlier than that, in 2006, with Xen as the
hypervisor.)

The "gory details" of the 2010 discussion are chronicled here:

  https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html

but basically it was because packet checksums weren't being fully
computed on the host side (because QEMU on the host and the NIC driver
in the guest had agreed between themselves to turn off checksums
because they were unnecessary due to the "link" between the two being
entirely in local memory rather than an error-prone physical cable),
but

1) a partial checksum was being put into the packets at some point by
   "someone"

2) the "don't use checksums" info was known by the guest kernel, which
   would properly ignore the "bad" checksum), and

3) the packets were being read by the dhclient application on the
   guest side with a "raw" socket (thus bypassing the guest kernel UDP
   processing that would have known the checksum was irrelevant and
   ignore it)),

The "fix" for this ended up being two-tiered:

1) The ISC DHCP package (which contains the aforementioned dhclient
program) made a fix to their dhclient code which caused it to accept
packets anyway even if they didn't have a proper checksum (NB: that's
not a full explanation, and possibly not accurate). This remedied the
problem for guests with an updated dhclient. Here is the code with the
fix to ISC DHCP:

  https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365

This eliminated the issue for any new/updated guests that had the
fixed dhclient, but it didn't solve the problem for existing/old guest
images that didn't/couldn't get their dhclient updated. This brings us
to:

2) iptables added a new "CHECKSUM" target and "--checksum-fill"
action:

  http://patchwork.ozlabs.org/patch/58525/

and libvirt added an iptables rule for each virtual network to match
DHCP response packets and perform --checksum-fill. This way by the
time dhclient on the guest read the raw packet, the checksum would be
corrected, and the packet would be accepted. This was pushed upstream
in libvirt commit v0.8.2-142-gfd5b15ff1a.

The word at the time from those more knowledgeable than me was that
the bad checksum problem was really specific to ISC's dhclient running
on Linux, and so once their fix was in use everywhere dhclient was
used, bad checksums would be a thing of the past and the
--checksum-fill iptables rules would no longer be needed (but would
otherwise be harmless if they were still there).

(Plot twist: the dhclient code in fix (1) above apparently is on a
Linux-only code path - this is very important later!)

Based on this information (and also due to the opinion that fixing it
by having iptables modify the packet checksum was really the wrong way
to permanently fix things, i.e. an "ugly hack"), the nftables
developers made the decision to not implement an equivalent to
--checksum-fill in nftables. As a result, when I wrote the nftables
firewall backend for libvirt virtual networks earlier this year, it
didn't add in any rule to "fix" broken UDP checksums (since there was
apparently no equivalent in nftables and, after all, that was fixed
somewhere else 14 years ago, right???)

But last week, when Rich Jones was doing routine testing using a Fedora
40 host (the first Fedora release to use the nftables backend of libvirt's
network driver by default) and a FreeBSD guest, for "some strange
reason", the FreeBSD guest was unable to get an IP address from DHCP!!

  https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html

A few quick tests proved that it was the same old "bad checksum"
problem from 2010 come back to haunt us - it wasn't a Linux-only issue
after all.

Phil Sutter and Eric Garver (nftables people) pointed out that, while
nftables doesn't have an action that will *compute* the checksum of a
packet, it *does* have an action that will set the checksum to 0, and
suggested we try adding a "zero the checksum" rule for dhcp response
packets to our nftables ruleset. (Why? Because a checksum value of 0
in a IPv4 UDP packet is defined by RFC768 to mean "no checksum
generated", implying "checksum not needed").  It turns out that this
works - dhclient properly recognizes that a 0 checksum means "don't
bother with the checksum", and accepts the packet as valid.

So to once again fix this timeless bug, this patch adds such a
checksum zeroing rule to the nftables rules setup for each virtual
network.

This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
and OpenBSD guests, while not breaking it for Fedora or Windows (10)
guests.

Fixes: b89c4991da
Reported-by: Rich Jones <rjones@redhat.com>
Fix-Suggested-by: Eric Garver <egarver@redhat.com>
Fix-Suggested-by: Phil Sutter <psutter@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2024-10-25 12:00:52 -04:00
Laine Stump
397c0f4b01 network: add more firewall test cases
This patch adds some previously missing test cases that test for
proper firewall rule creation when the following are included in the
network definition:

* <forward dev='blah'>
* no forward element (an "isolated" network)
* nat port range when only ipv4 is nat-ed
* nat port range when both ipv4 & ipv6 are nated

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
2024-06-24 13:51:04 +01:00
Laine Stump
aabf279ca0 tests: fix broken nftables test data so that individual tests are successful
When the chain names and table name used by the nftables firewall
backend were changed in commit
958aa7f274, I forgot to change the test
data file base.nftables, which has the extra "list" and "add
chain/table" commands that are generated for the first test case of
networkxml2firewalltest.c. When the full set of tests is run, the
first test will be an iptables test case, so those extra commands
won't be added to any of the nftables cases, and so the data in
base.nftables never matches, and the tests are all successful.

However, if the test are limited with, e.g. VIR_TEST_RANGE=2 (test #2
will be the nftables version of the 1st test case), then the commands
to add nftables table/chains *will* be generated in the test output,
and so the test will fail. Because I was only running the entire test
series after the initial commits of nftables tests, I didn't notice
this. Until now.

base.nftables has now been updated to reflect the current names for
chains/table, and running individual test cases is once again
successful.

Fixes: 958aa7f274
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Laine Stump <laine@redhat.com>
2024-06-24 13:49:26 +01:00
Laine Stump
a4f38f6ffe network: use iif/oif instead of iifname/oifname in nftables rules
iifname/oifname need to lookup the string that contains the name of
the interface each time a packet is checked, while iif/oif compare the
ifindex of the interface, which is included directly in the
packet. Conveniently, the rule is created using the *name* of the
interface (which gets converted to ifindex as the rule is added), so
no extra work is required other than changing the commandline option.

If it was the case that the interface could be deleted and re-added
during the life of the rule, we would have to use Xifname (since
deleting and re-adding the interface would result in ifindex
changing), but for our uses this never happens, so Xif works for us,
and undoubtedly improves performance by at least 0.0000001%.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>
2024-05-27 23:53:58 +02:00
Laine Stump
afbd1bb89e network: eliminate pointless host input/output rules from nftables backend
The iptables backend (which was used as the model for the nftables
backend) used the same "filter" and "nat" tables used by other
services on the system (e.g. firewalld or any other host firewall
management application), so it was possible that one of those other
services would be blocking DNS, DHCP, or TFTP from guests to the host;
we added our own rules at the beginning of the chain to allow this
traffic no matter if someone else rejected it later.

But with nftables, each service uses their own table, and all traffic
must be acepted by all tables no matter what - it's not possible for
us to just insert a higher priority/earlier rule that will override
some reject rule put in by, e.g., firewalld. Instead the firewalld (or
other) table must be setup by that service to allow the traffic. That,
along with the fact that our table is already "accept by default",
makes it possible to eliminate the individual accept rules for DHCP,
DNS, and TFTP. And once those rules are eliminated, there is no longer
any need for the guest_to_host or host_to_guest tables.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2024-05-22 23:20:49 -04:00
Laine Stump
958aa7f274 network: rename chains used by network driver nftables backend
Because the chains added by the network driver nftables backend will
go into a table used only by libvirt, we don't need to have "libvirt"
in the chain names. Instead, we can make them more descriptive and
less abrasive (by using lower case, and using full words rather than
abbreviations).

Also (again because nobody else is using the private "libvirt_network"
table) we can directly put our rules into the input ("guest_to_host"),
output ("host_to_guest"), and postrouting ("guest_nat") chains rather
than creating a subordinate chain as done in the iptables backend.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2024-05-22 23:20:49 -04:00
Laine Stump
0bd7a47356 network: name the nftables table "libvirt_network" rather than "libvirt"
This way when we implement nftables for the nwfilter driver, we can
create a separate table called "libvirt_nwfilter" and everything will
look all symmetrical and stuff.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2024-05-22 23:20:49 -04:00
Laine Stump
f341bdee8d tests: test cases for nftables backend
Run all the networkxml2firewall tests twice - once with iptables
backend, and once with the nftables backend.

The results files for the existing iptables tests were previously
named *.args. That has been changed to *.iptables, and the results
files for the new nftables tests are named *.nftables.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2024-05-22 23:20:37 -04:00
Laine Stump
e66451f685 util/tests: enable locking on iptables/ebtables commandlines in unit tests
All the unit tests that use iptables/ip6tables/ebtables have been
written to omit the locking/exclusive use primitive on the generated
commandlines. Even though none of the tests actually execute those
commands (and so it doesn't matter for purposes of the test whether or
not the commands support these options), it still made sense when some
systems had these locking options and some didn't.

We are now at a point where every supported Linux distro has supported
the locking options on these commands for quite a long time, and are
going to make their use non-optional. As a first step, this patch uses
the virFirewallSetLockOverride() function, which is called at the
beginning of all firewall-related tests, to set all the bools
controlling whether or not the locking options are used to true. This
means that all the test cases must be updated to include the proper
locking option in their commandlines.

The change to make actual execs of the commands unconditionally use
the locking option will be in an upcoming patch - this one affects
only the unit tests.

Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
2020-11-24 14:21:08 -05:00
Daniel P. Berrangé
8a4f331e8c network: wire up support for IPv6 NAT rules
Now that we have support for IPv6 in the iptables helpers, and a new
option in the XML schema, we can wire up support for it in the network
driver.

Reviewed-by: Laine Stump <laine@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2020-06-15 17:10:15 +01:00
Malina Salina
313a71ee7b network: allow DHCP/DNS/TFTP explicitly in OUTPUT rules
While the default iptables setup used by Fedora/RHEL distros
only restricts traffic on the INPUT and/or FORWARD rules,
some users might have custom firewalls that restrict the
OUTPUT rules too.

These can prevent DHCP/DNS/TFTP responses from dnsmasq
from reaching the guest VMs. We should thus whitelist
these protocols in the OUTPUT chain, as well as the
INPUT chain.

Signed-off-by: Malina Salina <malina.salina@protonmail.com>

Initial patch then modified to add unit tests and IPv6
support

Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2019-10-18 18:49:54 +01:00
Daniel P. Berrangé
c6cbe18771 network: delay global firewall setup if no networks are running
Creating firewall rules for the virtual networks causes the kernel to
load the conntrack module. This imposes a significant performance
penalty on Linux network traffic. Thus we want to only take that hit if
we actually have virtual networks running.

We need to create global firewall rules during startup in order to
"upgrade" rules for any running networks created by older libvirt.
If no running networks are present though, we can safely delay setup
until the time we actually start a network.

Reviewed-by: Jim Fehlig <jfehlig@suse.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2019-05-23 16:29:48 +01:00
Daniel P. Berrangé
7431b3eb9a util: move virtual network firwall rules into private chains
The previous commit created new chains to hold the firewall rules. This
commit changes the code that creates rules to place them in the new
private chains instead of the builtin top level chains.

With two networks running, the rules in the filter table now look like

  -N LIBVIRT_FWI
  -N LIBVIRT_FWO
  -N LIBVIRT_FWX
  -N LIBVIRT_INP
  -N LIBVIRT_OUT
  -A INPUT -j LIBVIRT_INP
  -A FORWARD -j LIBVIRT_FWX
  -A FORWARD -j LIBVIRT_FWI
  -A FORWARD -j LIBVIRT_FWO
  -A OUTPUT -j LIBVIRT_OUT
  -A LIBVIRT_FWI -d 192.168.0.0/24 -o virbr0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
  -A LIBVIRT_FWI -o virbr0 -j REJECT --reject-with icmp-port-unreachable
  -A LIBVIRT_FWI -d 192.168.1.0/24 -o virbr1 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
  -A LIBVIRT_FWI -o virbr1 -j REJECT --reject-with icmp-port-unreachable
  -A LIBVIRT_FWO -s 192.168.0.0/24 -i virbr0 -j ACCEPT
  -A LIBVIRT_FWO -i virbr0 -j REJECT --reject-with icmp-port-unreachable
  -A LIBVIRT_FWO -s 192.168.1.0/24 -i virbr1 -j ACCEPT
  -A LIBVIRT_FWO -i virbr1 -j REJECT --reject-with icmp-port-unreachable
  -A LIBVIRT_FWX -i virbr0 -o virbr0 -j ACCEPT
  -A LIBVIRT_FWX -i virbr1 -o virbr1 -j ACCEPT
  -A LIBVIRT_INP -i virbr0 -p udp -m udp --dport 53 -j ACCEPT
  -A LIBVIRT_INP -i virbr0 -p tcp -m tcp --dport 53 -j ACCEPT
  -A LIBVIRT_INP -i virbr0 -p udp -m udp --dport 67 -j ACCEPT
  -A LIBVIRT_INP -i virbr0 -p tcp -m tcp --dport 67 -j ACCEPT
  -A LIBVIRT_INP -i virbr1 -p udp -m udp --dport 53 -j ACCEPT
  -A LIBVIRT_INP -i virbr1 -p tcp -m tcp --dport 53 -j ACCEPT
  -A LIBVIRT_INP -i virbr1 -p udp -m udp --dport 67 -j ACCEPT
  -A LIBVIRT_INP -i virbr1 -p tcp -m tcp --dport 67 -j ACCEPT
  -A LIBVIRT_OUT -o virbr0 -p udp -m udp --dport 68 -j ACCEPT
  -A LIBVIRT_OUT -o virbr1 -p udp -m udp --dport 68 -j ACCEPT

While in the nat table:

  -N LIBVIRT_PRT
  -A POSTROUTING -j LIBVIRT_PRT
  -A LIBVIRT_PRT -s 192.168.0.0/24 -d 224.0.0.0/24 -j RETURN
  -A LIBVIRT_PRT -s 192.168.0.0/24 -d 255.255.255.255/32 -j RETURN
  -A LIBVIRT_PRT -s 192.168.0.0/24 ! -d 192.168.0.0/24 -p tcp -j MASQUERADE --to-ports 1024-65535
  -A LIBVIRT_PRT -s 192.168.0.0/24 ! -d 192.168.0.0/24 -p udp -j MASQUERADE --to-ports 1024-65535
  -A LIBVIRT_PRT -s 192.168.0.0/24 ! -d 192.168.0.0/24 -j MASQUERADE
  -A LIBVIRT_PRT -s 192.168.1.0/24 -d 224.0.0.0/24 -j RETURN
  -A LIBVIRT_PRT -s 192.168.1.0/24 -d 255.255.255.255/32 -j RETURN
  -A LIBVIRT_PRT -s 192.168.1.0/24 ! -d 192.168.1.0/24 -p tcp -j MASQUERADE --to-ports 1024-65535
  -A LIBVIRT_PRT -s 192.168.1.0/24 ! -d 192.168.1.0/24 -p udp -j MASQUERADE --to-ports 1024-65535
  -A LIBVIRT_PRT -s 192.168.1.0/24 ! -d 192.168.1.0/24 -j MASQUERADE

And finally the mangle table:

  -N LIBVIRT_PRT
  -A POSTROUTING -j LIBVIRT_PRT
  -A LIBVIRT_PRT -o virbr0 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill
  -A LIBVIRT_PRT -o virbr1 -p udp -m udp --dport 68 -j CHECKSUM --checksum-fill

Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
2019-01-29 13:37:11 +00:00
Daniel P. Berrange
8afd34f2d8 tests: redo test argv file line wrapping
Back in

  commit bd6c46fa0c
  Author: Juerg Haefliger <juerg.haefliger@hp.com>
  Date:   Mon Jan 31 06:42:57 2011 -0500

    tests: handle backspace-newline pairs in test input files

all the test argv files were line wrapped so that the args
were less than 80 characters.

The way the line wrapping was done turns out to be quite
undesirable, because it often leaves multiple parameters
on the same line. If we later need to add or remove
individual parameters, then it leaves us having to redo
line wrapping.

This commit changes the line wrapping so that every
single "-param value" is one its own new line. If the
"value" is still too long, then we break on ',' or ':'
or ' ' as needed.

This means that when we come to add / remove parameters
from the test files line, the patch diffs will only
ever show a single line added/removed which will greatly
simplify review work.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
2015-11-09 15:50:39 +00:00
Daniel P. Berrange
20512b8436 Add test for converting network XML to iptables rules
Using the virCommand dry run capability, capture iptables rules
created by various network XML documents.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
2014-04-25 15:44:09 +01:00