2013-05-03 16:58:26 +01:00
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
|
|
<body>
|
|
|
|
<h1>Control Groups Resource Management</h1>
|
|
|
|
|
|
|
|
<ul id="toc"></ul>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The QEMU and LXC drivers make use of the Linux "Control Groups" facility
|
|
|
|
for applying resource management to their virtual machines and containers.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h2><a name="requiredControllers">Required controllers</a></h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The control groups filesystem supports multiple "controllers". By default
|
|
|
|
the init system (such as systemd) should mount all controllers compiled
|
|
|
|
into the kernel at <code>/sys/fs/cgroup/$CONTROLLER-NAME</code>. Libvirt
|
|
|
|
will never attempt to mount any controllers itself, merely detect where
|
|
|
|
they are mounted.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The QEMU driver is capable of using the <code>cpuset</code>,
|
|
|
|
<code>cpu</code>, <code>memory</code>, <code>blkio</code> and
|
|
|
|
<code>devices</code> controllers. None of them are compulsory.
|
|
|
|
If any controller is not mounted, the resource management APIs
|
|
|
|
which use it will cease to operate. It is possible to explicitly
|
|
|
|
turn off use of a controller, even when mounted, via the
|
|
|
|
<code>/etc/libvirt/qemu.conf</code> configuration file.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The LXC driver is capable of using the <code>cpuset</code>,
|
2014-03-27 12:21:11 +01:00
|
|
|
<code>cpu</code>, <code>cpuacct</code>, <code>freezer</code>,
|
2013-05-03 16:58:26 +01:00
|
|
|
<code>memory</code>, <code>blkio</code> and <code>devices</code>
|
2014-03-27 12:21:11 +01:00
|
|
|
controllers. The <code>cpuacct</code>, <code>devices</code>
|
2013-05-03 16:58:26 +01:00
|
|
|
and <code>memory</code> controllers are compulsory. Without
|
|
|
|
them mounted, no containers can be started. If any of the
|
|
|
|
other controllers are not mounted, the resource management APIs
|
|
|
|
which use them will cease to operate.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h2><a name="currentLayout">Current cgroups layout</a></h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
As of libvirt 1.0.5 or later, the cgroups layout created by libvirt has been
|
|
|
|
simplified, in order to facilitate the setup of resource control policies by
|
2013-11-11 16:58:35 +00:00
|
|
|
administrators / management applications. The new layout is based on the concepts
|
|
|
|
of "partitions" and "consumers". A "consumer" is a cgroup which holds the
|
|
|
|
processes for a single virtual machine or container. A "partition" is a cgroup
|
|
|
|
which does not contain any processes, but can have resource controls applied.
|
|
|
|
A "partition" will have zero or more child directories which may be either
|
|
|
|
"consumer" or "partition".
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
As of libvirt 1.1.1 or later, the cgroups layout will have some slight
|
|
|
|
differences when running on a host with systemd 205 or later. The overall
|
|
|
|
tree structure is the same, but there are some differences in the naming
|
|
|
|
conventions for the cgroup directories. Thus the following docs split
|
|
|
|
in two, one describing systemd hosts and the other non-systemd hosts.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h3><a name="currentLayoutSystemd">Systemd cgroups integration</a></h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
On hosts which use systemd, each consumer maps to a systemd scope unit,
|
|
|
|
while partitions map to a system slice unit.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h4><a name="systemdScope">Systemd scope naming</a></h4>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The systemd convention is for the scope name of virtual machines / containers
|
|
|
|
to be of the general format <code>machine-$NAME.scope</code>. Libvirt forms the
|
|
|
|
<code>$NAME</code> part of this by concatenating the driver type with the name
|
|
|
|
of the guest, and then escaping any systemd reserved characters.
|
|
|
|
So for a guest <code>demo</code> running under the <code>lxc</code> driver,
|
|
|
|
we get a <code>$NAME</code> of <code>lxc-demo</code> which when escaped is
|
|
|
|
<code>lxc\x2ddemo</code>. So the complete scope name is <code>machine-lxc\x2ddemo.scope</code>.
|
|
|
|
The scope names map directly to the cgroup directory names.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h4><a name="systemdSlice">Systemd slice naming</a></h4>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The systemd convention for slice naming is that a slice should include the
|
|
|
|
name of all of its parents prepended on its own name. So for a libvirt
|
|
|
|
partition <code>/machine/engineering/testing</code>, the slice name will
|
|
|
|
be <code>machine-engineering-testing.slice</code>. Again the slice names
|
|
|
|
map directly to the cgroup directory names. Systemd creates three top level
|
|
|
|
slices by default, <code>system.slice</code> <code>user.slice</code> and
|
|
|
|
<code>machine.slice</code>. All virtual machines or containers created
|
|
|
|
by libvirt will be associated with <code>machine.slice</code> by default.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h4><a name="systemdLayout">Systemd cgroup layout</a></h4>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Given this, a possible systemd cgroups layout involving 3 qemu guests,
|
|
|
|
3 lxc containers and 3 custom child slices, would be:
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
$ROOT
|
|
|
|
|
|
|
|
|
+- system.slice
|
|
|
|
| |
|
|
|
|
| +- libvirtd.service
|
|
|
|
|
|
|
|
|
+- machine.slice
|
|
|
|
|
|
|
|
|
+- machine-qemu\x2dvm1.scope
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- machine-qemu\x2dvm2.scope
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- machine-qemu\x2dvm3.scope
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- machine-engineering.slice
|
|
|
|
| |
|
|
|
|
| +- machine-engineering-testing.slice
|
|
|
|
| | |
|
|
|
|
| | +- machine-lxc\x2dcontainer1.scope
|
|
|
|
| |
|
|
|
|
| +- machine-engineering-production.slice
|
|
|
|
| |
|
|
|
|
| +- machine-lxc\x2dcontainer2.scope
|
|
|
|
|
|
|
|
|
+- machine-marketing.slice
|
|
|
|
|
|
|
|
|
+- machine-lxc\x2dcontainer3.scope
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<h3><a name="currentLayoutGeneric">Non-systemd cgroups layout</a></h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
On hosts which do not use systemd, each consumer has a corresponding cgroup
|
|
|
|
named <code>$VMNAME.libvirt-{qemu,lxc}</code>. Each consumer is associated
|
|
|
|
with exactly one partition, which also have a corresponding cgroup usually
|
|
|
|
named <code>$PARTNAME.partition</code>. The exceptions to this naming rule
|
|
|
|
are the three top level default partitions, named <code>/system</code> (for
|
|
|
|
system services), <code>/user</code> (for user login sessions) and
|
|
|
|
<code>/machine</code> (for virtual machines and containers). By default
|
|
|
|
every consumer will of course be associated with the <code>/machine</code>
|
|
|
|
partition.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Given this, a possible systemd cgroups layout involving 3 qemu guests,
|
|
|
|
3 lxc containers and 2 custom child slices, would be:
|
2013-05-03 16:58:26 +01:00
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
$ROOT
|
|
|
|
|
|
|
|
|
+- system
|
|
|
|
| |
|
|
|
|
| +- libvirtd.service
|
|
|
|
|
|
|
|
|
+- machine
|
|
|
|
|
|
|
|
|
+- vm1.libvirt-qemu
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- vm2.libvirt-qemu
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- vm3.libvirt-qemu
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
2013-11-11 16:58:35 +00:00
|
|
|
+- engineering.partition
|
|
|
|
| |
|
|
|
|
| +- testing.partition
|
|
|
|
| | |
|
|
|
|
| | +- container1.libvirt-lxc
|
|
|
|
| |
|
|
|
|
| +- production.partition
|
|
|
|
| |
|
|
|
|
| +- container2.libvirt-lxc
|
2013-05-03 16:58:26 +01:00
|
|
|
|
|
2013-11-11 16:58:35 +00:00
|
|
|
+- marketing.partition
|
|
|
|
|
|
|
|
|
+- container3.libvirt-lxc
|
2013-05-03 16:58:26 +01:00
|
|
|
</pre>
|
|
|
|
|
|
|
|
<h2><a name="customPartiton">Using custom partitions</a></h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
If there is a need to apply resource constraints to groups of
|
|
|
|
virtual machines or containers, then the single default
|
|
|
|
partition <code>/machine</code> may not be sufficiently
|
|
|
|
flexible. The administrator may wish to sub-divide the
|
|
|
|
default partition, for example into "testing" and "production"
|
|
|
|
partitions, and then assign each guest to a specific
|
|
|
|
sub-partition. This is achieved via a small element addition
|
|
|
|
to the guest domain XML config, just below the main <code>domain</code>
|
|
|
|
element
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
2016-11-11 23:40:27 +01:00
|
|
|
...
|
|
|
|
<resource>
|
|
|
|
<partition>/machine/production</partition>
|
|
|
|
</resource>
|
|
|
|
...
|
2013-05-03 16:58:26 +01:00
|
|
|
</pre>
|
|
|
|
|
2013-11-11 16:58:35 +00:00
|
|
|
<p>
|
|
|
|
Note that the partition names in the guest XML are using a
|
|
|
|
generic naming format, not the low level naming convention
|
|
|
|
required by the underlying host OS. That is, you should not include
|
|
|
|
any of the <code>.partition</code> or <code>.slice</code>
|
|
|
|
suffixes in the XML config. Given a partition name
|
|
|
|
<code>/machine/production</code>, libvirt will automatically
|
|
|
|
apply the platform specific translation required to get
|
|
|
|
<code>/machine/production.partition</code> (non-systemd)
|
|
|
|
or <code>/machine.slice/machine-production.slice</code>
|
|
|
|
(systemd) as the underlying cgroup name
|
|
|
|
</p>
|
|
|
|
|
2013-05-03 16:58:26 +01:00
|
|
|
<p>
|
|
|
|
Libvirt will not auto-create the cgroups directory to back
|
|
|
|
this partition. In the future, libvirt / virsh will provide
|
|
|
|
APIs / commands to create custom partitions, but currently
|
2013-11-11 16:58:35 +00:00
|
|
|
this is left as an exercise for the administrator.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
<strong>Note:</strong> the ability to place guests in custom
|
|
|
|
partitions is only available with libvirt >= 1.0.5, using
|
|
|
|
the new cgroup layout. The legacy cgroups layout described
|
|
|
|
later in this document did not support customization per guest.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h3><a name="createSystemd">Creating custom partitions (systemd)</a></h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Given the XML config above, the admin on a systemd based host would
|
|
|
|
need to create a unit file <code>/etc/systemd/system/machine-production.slice</code>
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
# cat > /etc/systemd/system/machine-testing.slice <<EOF
|
|
|
|
[Unit]
|
|
|
|
Description=VM testing slice
|
|
|
|
Before=slices.target
|
|
|
|
Wants=machine.slice
|
|
|
|
EOF
|
|
|
|
# systemctl start machine-testing.slice
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<h3><a name="createNonSystemd">Creating custom partitions (non-systemd)</a></h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Given the XML config above, the admin on a non-systemd based host
|
|
|
|
would need to create a cgroup named '/machine/production.partition'
|
2013-05-03 16:58:26 +01:00
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
# cd /sys/fs/cgroup
|
|
|
|
# for i in blkio cpu,cpuacct cpuset devices freezer memory net_cls perf_event
|
|
|
|
do
|
|
|
|
mkdir $i/machine/production.partition
|
|
|
|
done
|
|
|
|
# for i in cpuset.cpus cpuset.mems
|
|
|
|
do
|
|
|
|
cat cpuset/machine/$i > cpuset/machine/production.partition/$i
|
|
|
|
done
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<h2><a name="resourceAPIs">Resource management APIs/commands</a></h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Since libvirt aims to provide an API which is portable across
|
|
|
|
hypervisors, the concept of cgroups is not exposed directly
|
|
|
|
in the API or XML configuration. It is considered to be an
|
|
|
|
internal implementation detail. Instead libvirt provides a
|
|
|
|
set of APIs for applying resource controls, which are then
|
|
|
|
mapped to corresponding cgroup tunables
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h3>Scheduler tuning</h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Parameters from the "cpu" controller are exposed via the
|
|
|
|
<code>schedinfo</code> command in virsh.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
# virsh schedinfo demo
|
|
|
|
Scheduler : posix
|
|
|
|
cpu_shares : 1024
|
|
|
|
vcpu_period : 100000
|
|
|
|
vcpu_quota : -1
|
|
|
|
emulator_period: 100000
|
|
|
|
emulator_quota : -1</pre>
|
|
|
|
|
|
|
|
|
|
|
|
<h3>Block I/O tuning</h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Parameters from the "blkio" controller are exposed via the
|
|
|
|
<code>bkliotune</code> command in virsh.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
# virsh blkiotune demo
|
|
|
|
weight : 500
|
|
|
|
device_weight : </pre>
|
|
|
|
|
|
|
|
<h3>Memory tuning</h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Parameters from the "memory" controller are exposed via the
|
|
|
|
<code>memtune</code> command in virsh.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
# virsh memtune demo
|
|
|
|
hard_limit : 580192
|
|
|
|
soft_limit : unlimited
|
|
|
|
swap_hard_limit: unlimited
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<h3>Network tuning</h3>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
The <code>net_cls</code> is not currently used. Instead traffic
|
|
|
|
filter policies are set directly against individual virtual
|
|
|
|
network interfaces.
|
|
|
|
</p>
|
|
|
|
|
|
|
|
<h2><a name="legacyLayout">Legacy cgroups layout</a></h2>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Prior to libvirt 1.0.5, the cgroups layout created by libvirt was different
|
|
|
|
from that described above, and did not allow for administrator customization.
|
|
|
|
Libvirt used a fixed, 3-level hierarchy <code>libvirt/{qemu,lxc}/$VMNAME</code>
|
|
|
|
which was rooted at the point in the hierarchy where libvirtd itself was
|
|
|
|
located. So if libvirtd was placed at <code>/system/libvirtd.service</code>
|
|
|
|
by systemd, the groups for each virtual machine / container would be located
|
|
|
|
at <code>/system/libvirtd.service/libvirt/{qemu,lxc}/$VMNAME</code>. In addition
|
|
|
|
to this, the QEMU drivers further child groups for each vCPU thread and the
|
|
|
|
emulator thread(s). This leads to a hierarchy that looked like
|
|
|
|
</p>
|
|
|
|
|
|
|
|
|
|
|
|
<pre>
|
|
|
|
$ROOT
|
|
|
|
|
|
|
|
|
+- system
|
|
|
|
|
|
|
|
|
+- libvirtd.service
|
|
|
|
|
|
|
|
|
+- libvirt
|
|
|
|
|
|
|
|
|
+- qemu
|
|
|
|
| |
|
|
|
|
| +- vm1
|
|
|
|
| | |
|
|
|
|
| | +- emulator
|
|
|
|
| | +- vcpu0
|
|
|
|
| | +- vcpu1
|
|
|
|
| |
|
|
|
|
| +- vm2
|
|
|
|
| | |
|
|
|
|
| | +- emulator
|
|
|
|
| | +- vcpu0
|
|
|
|
| | +- vcpu1
|
|
|
|
| |
|
|
|
|
| +- vm3
|
|
|
|
| |
|
|
|
|
| +- emulator
|
|
|
|
| +- vcpu0
|
|
|
|
| +- vcpu1
|
|
|
|
|
|
|
|
|
+- lxc
|
|
|
|
|
|
|
|
|
+- container1
|
|
|
|
|
|
|
|
|
+- container2
|
|
|
|
|
|
|
|
|
+- container3
|
|
|
|
</pre>
|
|
|
|
|
|
|
|
<p>
|
|
|
|
Although current releases are much improved, historically the use of deep
|
|
|
|
hierarchies has had a significant negative impact on the kernel scalability.
|
|
|
|
The legacy libvirt cgroups layout highlighted these problems, to the detriment
|
|
|
|
of the performance of virtual machines and containers.
|
|
|
|
</p>
|
|
|
|
</body>
|
|
|
|
</html>
|