QEMU/KVM hypervisor driver

The libvirt QEMU driver can manage any QEMU emulator from version 0.8.1 or later. It can also manage anything that provides the same QEMU command line syntax and monitor interaction. This includes KVM, and Xenner.

Deployment pre-requisites

Connections to QEMU driver

The libvirt QEMU driver is a multi-instance driver, providing a single system wide privileged driver (the "system" instance), and per-user unprivileged drivers (the "session" instance). The of the driver protocol is "qemu". Some example conection URIs for the libvirt driver are:

    qemu:///session                      (local access to per-user instance)
    qemu+unix:///session                 (local access to per-user instance)

    qemu:///system                       (local access to system instance)
    qemu+unix:///system                  (local access to system instance)
    qemu://example.com/system            (remote access, TLS/x509)
    qemu+tcp://example.com/system        (remote access, SASl/Kerberos)
    qemu+ssh://root@example.com/system   (remote access, SSH tunnelled)
    

Driver security architecture

There are multiple layers to security in the QEMU driver, allowing for flexibility in the use of QEMU based virtual machines.

Driver instances

As explained above there are two ways to access the QEMU driver in libvirt. The "qemu:///session" family of URIs connect to a libvirtd instance running as the same user/group ID as the client application. Thus the QEMU instances spawned from this driver will share the same privileges as the client application. The intended use case for this driver is desktop virtualization, with virtual machines storing their disk imags in the user's home directory and being managed from the local desktop login session.

The "qemu:///system" family of URIs connect to a libvirtd instance running as the privileged system account 'root'. Thus the QEMU instances spawned from this driver may have much higher privileges than the client application managing them. The intended use case for this driver is server virtualization, where the virtual machines may need to be connected to host resources (block, PCI, USB, network devices) whose access requires elevated privileges.

POSIX users/groups

In the "session" instance, the POSIX users/groups model restricts QEMU virtual machines (and libvirtd in general) to only have access to resources with the same user/group ID as the client application. There is no finer level of configuration possible for the "session" instances.

In the "system" instance, libvirt releases from 0.7.0 onwards allow control over the user/group that the QEMU virtual machines are run as. A build of libvirt with no configuration parameters set will still run QEMU processes as root:root. It is possible to change this default by using the --with-qemu-user=$USERNAME and --with-qemu-group=$GROUPNAME arguments to 'configure' during build. It is strongly recommended that vendors build with both of these arguments set to 'qemu'. Regardless of this build time default, administrators can set a per-host default setting in the /etc/libvirt/qemu.conf configuration file via the user=$USERNAME and group=$GROUPNAME parameters. When a non-root user or group is configured, the libvirt QEMU driver will change uid/gid to match immediately before executing the QEMU binary for a virtual machine.

If QEMU virtual machines from the "system" instance are being run as non-root, there will be greater restrictions on what host resources the QEMU process will be able to access. The libvirtd daemon will attempt to manage permissions on resources to minimise the likelihood of unintentional security denials, but the administrator / application developer must be aware of some of the consequences / restrictions.

Linux process capabilities

The libvirt QEMU driver has a build time option allowing it to use the libcap-ng library to manage process capabilities. If this build option is enabled, then the QEMU driver will use this to ensure that all process capabilities are dropped before executing a QEMU virtual machine. Process capabilities are what gives the 'root' account its high power, in particular the CAP_DAC_OVERRIDE capability is what allows a process running as 'root' to access files owned by any user.

If the QEMU driver is configured to run virtual machines as non-root, then they will already loose all their process capabilities at time of startup. The Linux capability feature is thus aimed primarily at the scenario where the QEMU processes are running as root. In this case, before launching a QEMU virtual machine, libvirtd will use libcap-ng APIs to drop all process capabilities. It is important for administrators to note that this implies the QEMU process will only be able to access files owned by root, and not files owned by any other user.

Thus, if a vendor / distributor has configured their libvirt package to run as 'qemu' by default, a number of changes will be required before an administrator can change a host to run guests as root. In particular it will be neccessary to change ownership on the directories /var/run/libvirt/qemu/, /var/lib/libvirt/qemu/ and /var/cache/libvirt/qemu/ back to root, in addition to changing the /etc/libvirt/qemu.conf settings.

SELinux basic confinement

The basic SELinux protection for QEMU virtual machines is intended to protect the host OS from a compromised virtual machine process. There is no protection between guests.

In the basic model, all QEMU virtual machines run under the confined domain root:system_r:qemu_t. It is required that any disk image assigned to a QEMU virtual machine is labelled with system_u:object_r:virt_image_t. In a default deployment, package vendors/distributor will typically ensure that the directory /var/lib/libvirt/images has this label, such that any disk images created in this directory will automatically inherit the correct labelling. If attempting to use disk images in another location, the user/administrator must ensure the directory has be given this requisite label. Likewise physical block devices must be labelled system_u:object_r:virt_image_t.

Not all filesystems allow for labelling of individual files. In particular NFS, VFat and NTFS have no support for labelling. In these cases administrators must use the 'context' option when mounting the filesystem to set the default label to system_u:object_r:virt_image_t. In the case of NFS, there is an alternative option, of enabling the virt_use_nfs SELinux boolean.

SELinux sVirt confinement

The SELinux sVirt protection for QEMU virtual machines builds to the basic level of protection, to also allow individual guests to be protected from each other.

In the sVirt model, each QEMU virtual machine runs under its own confined domain, which is based on system_u:system_r:svirt_t:s0 with a unique category appended, eg, system_u:system_r:svirt_t:s0:c34,c44. The rules are setup such that a domain can only access files which are labelled with the matching category level, eg system_u:object_r:svirt_image_t:s0:c34,c44. This prevents one QEMU process accessing any file resources that are prevent to another QEMU process.

There are two ways of assigning labels to virtual machines under sVirt. In the default setup, if sVirt is enabled, guests will get an automatically assigned unique label each time they are booted. The libvirtd daemon will also automatically relabel exclusive access disk images to match this label. Disks that are marked as <shared> will get a generic label system_u:system_r:svirt_image_t:s0 allowing all guests read/write access them, while disks marked as <readonly> will get a generic label system_u:system_r:svirt_content_t:s0 which allows all guests read-only access.

With statically assigned labels, the application should include the desired guest and file labels in the XML at time of creating the guest with libvirt. In this scenario the application is responsible for ensuring the disk images & similar resources are suitably labelled to match, libvirtd will not attempt any relabelling.

If the sVirt security model is active, then the node capabilities XML will include its details. If a virtual machine is currently protected by the security model, then the guest XML will include its assigned labels. If enabled at compile time, the sVirt security model will always be activated if SELinux is available on the host OS. To disable sVirt, and revert to the basic level of SELinux protection (host protection only), the /etc/libvirt/qemu.conf file can be used to change the setting to security_driver="none"

Cgroups device ACLs

Recent Linux kernels have a capability known as "cgroups" which is used for resource management. It is implemented via a number of "controllers", each controller covering a specific task/functional area. One of the available controllers is the "devices" controller, which is able to setup whitelists of block/character devices that a cgroup should be allowed to access. If the "devices" controller is mounted on a host, then libvirt will automatically create a dedicated cgroup for each QEMU virtual machine and setup the device whitelist so that the QEMU process can only access shared devices, and explicitly disks images backed by block devices.

The list of shared devices a guest is allowed access to is

      /dev/null, /dev/full, /dev/zero,
      /dev/random, /dev/urandom,
      /dev/ptmx, /dev/kvm, /dev/kqemu,
      /dev/rtc, /dev/hpet, /dev/net/tun
    

In the event of unanticipated needs arising, this can be customized via the /etc/libvirt/qemu.conf file. To mount the cgroups device controller, the following command should be run as root, prior to starting libvirtd

      mkdir /dev/cgroup
      mount -t cgroup none /dev/cgroup -o devices
    

libvirt will then place each virtual machine in a cgroup at /dev/cgroup/libvirt/qemu/$VMNAME/

Import and export of libvirt domain XML configs

The QEMU driver currently supports a single native config format known as qemu-argv. The data for this format is expected to be a single line first a list of environment variables, then the QEMu binary name, finally followed by the QEMU command line arguments

Converting from QEMU args to domain XML

The virsh domxml-from-native provides a way to convert an existing set of QEMU args into a guest description using libvirt Domain XML that can then be used by libvirt.

$ cat > demo.args <<EOF
LC_ALL=C PATH=/bin HOME=/home/test USER=test \
LOGNAME=test /usr/bin/qemu -S -M pc -m 214 -smp 1 \
-nographic -monitor pty -no-acpi -boot c -hda \
/dev/HostVG/QEMUGuest1 -net none -serial none \
-parallel none -usb
EOF
$ virsh domxml-from-native qemu-argv demo.args
<domain type='qemu'>
  <uuid>00000000-0000-0000-0000-000000000000</uuid>
  <memory>219136</memory>
  <currentMemory>219136</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu</emulator>
    <disk type='block' device='disk'>
      <source dev='/dev/HostVG/QEMUGuest1'/>
      <target dev='hda' bus='ide'/>
    </disk>
  </devices>
</domain>
    

NB, don't include the literral \ in the args, put everything on one line

Converting from domain XML to QEMU args

The virsh domxml-to-native provides a way to convert a guest description using libvirt Domain XML, into a set of QEMU args that can be run manually.

$ cat > demo.xml <<EOF
<domain type='qemu'>
  <name>QEMUGuest1</name>
  <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
  <memory>219200</memory>
  <currentMemory>219200</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    <boot dev='hd'/>
  </os>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/qemu</emulator>
    <disk type='block' device='disk'>
      <source dev='/dev/HostVG/QEMUGuest1'/>
      <target dev='hda' bus='ide'/>
    </disk>
  </devices>
</domain>
EOF
$ virsh domxml-to-native qemu-argv demo.xml
  LC_ALL=C PATH=/usr/bin:/bin HOME=/home/test \
  USER=test LOGNAME=test /usr/bin/qemu -S -M pc \
  -no-kqemu -m 214 -smp 1 -name QEMUGuest1 -nographic \
  -monitor pty -no-acpi -boot c -drive \
  file=/dev/HostVG/QEMUGuest1,if=ide,index=0 -net none \
  -serial none -parallel none -usb
    

Example domain XML config

QEMU emulated guest on x86_64

<domain type='qemu'>
  <name>QEmu-fedora-i686</name>
  <uuid>c7a5fdbd-cdaf-9455-926a-d65c16db1809</uuid>
  <memory>219200</memory>
  <currentMemory>219200</currentMemory>
  <vcpu>2</vcpu>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    <boot dev='cdrom'/>
  </os>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='cdrom'>
      <source file='/home/user/boot.iso'/>
      <target dev='hdc'/>
      <readonly/>
    </disk>
    <disk type='file' device='disk'>
      <source file='/home/user/fedora.img'/>
      <target dev='hda'/>
    </disk>
    <interface type='network'>
      <source network='default'/>
    </interface>
    <graphics type='vnc' port='-1'/>
  </devices>
</domain>

KVM hardware accelerated guest on i686

<domain type='kvm'>
  <name>demo2</name>
  <uuid>4dea24b3-1d52-d8f3-2516-782e98a23fa0</uuid>
  <memory>131072</memory>
  <vcpu>1</vcpu>
  <os>
    <type arch="i686">hvm</type>
  </os>
  <clock sync="localtime"/>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <source file='/var/lib/libvirt/images/demo2.img'/>
      <target dev='hda'/>
    </disk>
    <interface type='network'>
      <source network='default'/>
      <mac address='24:42:53:21:52:45'/>
    </interface>
    <graphics type='vnc' port='-1' keymap='de'/>
  </devices>
</domain>

Xen paravirtualized guests with hardware acceleration