libvirt/docs/kbase/qemu-passthrough-security.rst
Michal Privoznik 0377177c78 qemu_process.c: Propagate hugetlbfs mounts on reconnect
When reconnecting to a running QEMU process, we construct the
per-domain path in all hugetlbfs mounts. This is a relict from
the past (v3.4.0-100-g5b24d25062) where we switched to a
per-domain path and we want to create those paths when libvirtd
restarts on upgrade.

And with namespaces enabled there is one corner case where the
path is not created. In fact an error is reported and the
reconnect fails. Ideally, all mount events are propagated into
the QEMU's namespace. And they probably are, except when the
target path does not exist inside the namespace. Now, it's pretty
common for users to mount hugetlbfs under /dev (e.g.
/dev/hugepages), but if domain is started without hugepages (or
more specifically - private hugetlbfs path wasn't created on
domain startup), then the reconnect code tries to create it.
But it fails to do so, well, it fails to set seclabels on the
path because, because the path does not exist in the private
namespace. And it doesn't exist because we specifically create
only a subset of all possible /dev nodes. Therefore, the mount
event, whilst propagated, is not successful and hence the
filesystem is not mounted. We have to do it ourselves.

If hugetlbfs is mount anywhere else there's no problem and this
is effectively a dead code.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2123196
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2022-09-23 16:33:48 +02:00

175 lines
6.8 KiB
ReStructuredText

=============================
QEMU command-line passthrough
=============================
.. contents::
Libvirt aims to provide explicit modelling of virtualization features in
the domain XML document schema. QEMU has a very broad range of features
and not all of these can be mapped to elements in the domain XML. Libvirt
would like to reduce the gap to QEMU, however, with finite resources there
will always be cases which aren't covered by the domain XML schema.
XML document additions
======================
To deal with the problem, libvirt introduced support for command-line
passthrough of QEMU arguments. This is achieved by supporting a custom
XML namespace, under which some QEMU driver specific elements are defined.
The canonical place to declare the namespace is on the top level ``<domain>``
element. At the very end of the document, arbitrary command-line arguments
can now be added, using the namespace prefix ``qemu:``
::
<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
<name>QEMUGuest1</name>
<uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
...
<qemu:commandline>
<qemu:arg value='-newarg'/>
<qemu:arg value='parameter'/>
<qemu:env name='ID' value='wibble'/>
<qemu:env name='BAR'/>
</qemu:commandline>
</domain>
Note that when an argument takes a value eg ``-newarg parameter``, the argument
and the value must be passed as separate ``<qemu:arg>`` entries.
Instead of declaring the XML namespace on the top level ``<domain>`` it is also
possible to declare it at time of use, which is more convenient for humans
writing the XML documents manually. So the following example is functionally
identical:
::
<domain type='kvm'>
<name>QEMUGuest1</name>
<uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
...
<commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0">
<arg value='-newarg'/>
<arg value='parameter'/>
<env name='ID' value='wibble'/>
<env name='BAR'/>
</commandline>
</domain>
Note that when querying the XML from libvirt, it will have been translated into
the canonical syntax once more with the namespace on the top level element.
Security confinement / sandboxing
=================================
When libvirt launches a QEMU process it makes use of a number of security
technologies to confine QEMU and thus protect the host from malicious VM
breakouts.
When configuring security protection, however, libvirt generally needs to know
exactly which host resources the VM is permitted to access. It gets this
information from the domain XML document. This only works for elements in the
regular schema, the arguments used with command-line passthrough are completely
opaque to libvirt.
As a result, if command-line passthrough is used to expose a file on the host
to QEMU, the security protections will activate and either kill QEMU or deny it
access.
There are two strategies for dealing with this problem, either figure out what
steps are needed to grant QEMU access to the device, or disable the security
protections. The former is harder, but more secure, while the latter is simple.
Granting access per VM
----------------------
* SELinux - the file on the host needs an SELinux label that will grant access
to QEMU's ``svirt_t`` policy.
- Read-only access - use the ``virt_content_t`` label
- Shared, write access - use the ``svirt_image_t:s0`` label (ie no Multi-
Category Security (MCS) value appended)
- Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM.
The MCS is auto-generatd at boot time, so this may require re-configuring
the VM to have a fixed MCS label
* Discretionary Access Control (DAC) - the file on the host needs to be
readable/writable to the ``qemu`` user or ``qemu`` group. This can be done
by changing the file ownership to ``qemu``, or relaxing the permissions to
allow world read, or adding file ACLs to allow access to ``qemu``.
* Namespaces - a private ``mount`` namespace is used for QEMU by default
which populates a new ``/dev`` with only the device nodes needed by QEMU.
There is no way to augment the set of device nodes ahead of time.
* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with
``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and
``resourcecontrol=deny`` settings active. There is no way to change this
policy on a per VM basis.
* Cgroups - a custom cgroup is created per VM and this will either use the
``devices`` controller or an ``BPF`` rule to define an access control list
for the set of device nodes.
There is no way to change this policy on a per VM basis.
Disabling security protection per VM
------------------------------------
Some of the security protections can be disabled per-VM:
* SELinux - in the domain XML the ``<seclabel>`` model can be changed to
``none`` instead of ``selinux``, which will make the VM run unconfined.
* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can
be added, configured with a user / group account of ``root`` to make QEMU run
with full privileges.
* Namespaces - there is no way to disable this per VM.
* Seccomp - there is no way to disable this per VM.
* Cgroups - there is no way to disable this per VM.
Disabling security protection host-wide
---------------------------------------
As a last resort it is possible to disable security protection host wide which
will affect all virtual machines. These settings are all made in
``/etc/libvirt/qemu.conf``
* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by
default, while still allowing explicit opt-in to SELinux for VMs.
* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root
account.
* SELinux, DAC - set ``security_driver = []`` to entirely disable both the
SELinux and DAC security drivers.
* Namespaces - set ``namespaces = []`` to disable use of the ``mount``
namespaces, causing QEMU to see the normal fully popualated ``dev``.
* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing
in QEMU.
* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or
``cgroup_controllers = [...]`` to exclude the ``devices`` controller.
Private monunt namespace
----------------------------
As mentioned above, libvirt launches each QEMU process in its own ``mount``
namespace. It's recommended that all mount points are set up prior starting any
guest. For cases when that can't be assured, mount points in the namespace are
marked as slave so that mount events happening in the parent namespace are
propagated into this child namespace. But this may require an additional step:
mounts in the parent namespace need to be marked as shared (if the distribution
doesn't do that by default). This can be achieved by running the following
command before any guest is started:
::
# mount --make-rshared /