mirror of
https://gitlab.com/libvirt/libvirt.git
synced 2024-12-22 13:45:38 +00:00
docs: Convert 'drvlxc' page to rST
Signed-off-by: Peter Krempa <pkrempa@redhat.com> Reviewed-by: Erik Skultety <eskultet@redhat.com>
This commit is contained in:
parent
c4611b327e
commit
05a514b0b3
@ -1,822 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE html>
|
||||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||||
<body>
|
||||
<h1>LXC container driver</h1>
|
||||
|
||||
<ul id="toc"></ul>
|
||||
|
||||
<p>
|
||||
The libvirt LXC driver manages "Linux Containers". At their simplest, containers
|
||||
can just be thought of as a collection of processes, separated from the main
|
||||
host processes via a set of resource namespaces and constrained via control
|
||||
groups resource tunables. The libvirt LXC driver has no dependency on the LXC
|
||||
userspace tools hosted on sourceforge.net. It directly utilizes the relevant
|
||||
kernel features to build the container environment. This allows for sharing
|
||||
of many libvirt technologies across both the QEMU/KVM and LXC drivers. In
|
||||
particular sVirt for mandatory access control, auditing of operations,
|
||||
integration with control groups and many other features.
|
||||
</p>
|
||||
|
||||
<h2><a id="cgroups">Control groups Requirements</a></h2>
|
||||
|
||||
<p>
|
||||
In order to control the resource usage of processes inside containers, the
|
||||
libvirt LXC driver requires that certain cgroups controllers are mounted on
|
||||
the host OS. The minimum required controllers are 'cpuacct', 'memory' and
|
||||
'devices', while recommended extra controllers are 'cpu', 'freezer' and
|
||||
'blkio'. Libvirt will not mount the cgroups filesystem itself, leaving
|
||||
this up to the init system to take care of. Systemd will do the right thing
|
||||
in this respect, while for other init systems the <code>cgconfig</code>
|
||||
init service will be required. For further information, consult the general
|
||||
libvirt <a href="cgroups.html">cgroups documentation</a>.
|
||||
</p>
|
||||
|
||||
<h2><a id="namespaces">Namespace requirements</a></h2>
|
||||
|
||||
<p>
|
||||
In order to separate processes inside a container from those in the
|
||||
primary "host" OS environment, the libvirt LXC driver requires that
|
||||
certain kernel namespaces are compiled in. Libvirt currently requires
|
||||
the 'mount', 'ipc', 'pid', and 'uts' namespaces to be available. If
|
||||
separate network interfaces are desired, then the 'net' namespace is
|
||||
required. If the guest configuration declares a
|
||||
<a href="formatdomain.html#elementsOSContainer">UID or GID mapping</a>,
|
||||
the 'user' namespace will be enabled to apply these. <strong>A suitably
|
||||
configured UID/GID mapping is a pre-requisite to making containers
|
||||
secure, in the absence of sVirt confinement.</strong>
|
||||
</p>
|
||||
|
||||
<h2><a id="init">Default container setup</a></h2>
|
||||
|
||||
<h3><a id="cliargs">Command line arguments</a></h3>
|
||||
|
||||
<p>
|
||||
When the container "init" process is started, it will typically
|
||||
not be given any command line arguments (eg the equivalent of
|
||||
the bootloader args visible in <code>/proc/cmdline</code>). If
|
||||
any arguments are desired, then must be explicitly set in the
|
||||
container XML configuration via one or more <code>initarg</code>
|
||||
elements. For example, to run <code>systemd --unit emergency.service</code>
|
||||
would use the following XML
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
<os>
|
||||
<type arch='x86_64'>exe</type>
|
||||
<init>/bin/systemd</init>
|
||||
<initarg>--unit</initarg>
|
||||
<initarg>emergency.service</initarg>
|
||||
</os>
|
||||
</pre>
|
||||
|
||||
<h3><a id="envvars">Environment variables</a></h3>
|
||||
|
||||
<p>
|
||||
When the container "init" process is started, it will be given several useful
|
||||
environment variables. The following standard environment variables are mandated
|
||||
by <a href="https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface">systemd container interface</a>
|
||||
to be provided by all container technologies on Linux.
|
||||
</p>
|
||||
|
||||
<dl>
|
||||
<dt><code>container</code></dt>
|
||||
<dd>The fixed string <code>libvirt-lxc</code> to identify libvirt as the creator</dd>
|
||||
<dt><code>container_uuid</code></dt>
|
||||
<dd>The UUID assigned to the container by libvirt</dd>
|
||||
<dt><code>PATH</code></dt>
|
||||
<dd>The fixed string <code>/bin:/usr/bin</code></dd>
|
||||
<dt><code>TERM</code></dt>
|
||||
<dd>The fixed string <code>linux</code></dd>
|
||||
<dt><code>HOME</code></dt>
|
||||
<dd>The fixed string <code>/</code></dd>
|
||||
</dl>
|
||||
|
||||
<p>
|
||||
In addition to the standard variables, the following libvirt specific
|
||||
environment variables are also provided
|
||||
</p>
|
||||
|
||||
<dl>
|
||||
<dt><code>LIBVIRT_LXC_NAME</code></dt>
|
||||
<dd>The name assigned to the container by libvirt</dd>
|
||||
<dt><code>LIBVIRT_LXC_UUID</code></dt>
|
||||
<dd>The UUID assigned to the container by libvirt</dd>
|
||||
<dt><code>LIBVIRT_LXC_CMDLINE</code></dt>
|
||||
<dd>The unparsed command line arguments specified in the container configuration.
|
||||
Use of this is discouraged, in favour of passing arguments directly to the
|
||||
container init process via the <code>initarg</code> config element.</dd>
|
||||
</dl>
|
||||
|
||||
<h3><a id="fsmounts">Filesystem mounts</a></h3>
|
||||
|
||||
<p>
|
||||
In the absence of any explicit configuration, the container will
|
||||
inherit the host OS filesystem mounts. A number of mount points will
|
||||
be made read only, or re-mounted with new instances to provide
|
||||
container specific data. The following special mounts are setup
|
||||
by libvirt
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><code>/dev</code> a new "tmpfs" pre-populated with authorized device nodes</li>
|
||||
<li><code>/dev/pts</code> a new private "devpts" instance for console devices</li>
|
||||
<li><code>/sys</code> the host "sysfs" instance remounted read-only</li>
|
||||
<li><code>/proc</code> a new instance of the "proc" filesystem</li>
|
||||
<li><code>/proc/sys</code> the host "/proc/sys" bind-mounted read-only</li>
|
||||
<li><code>/sys/fs/selinux</code> the host "selinux" instance remounted read-only</li>
|
||||
<li><code>/sys/fs/cgroup/NNNN</code> the host cgroups controllers bind-mounted to
|
||||
only expose the sub-tree associated with the container</li>
|
||||
<li><code>/proc/meminfo</code> a FUSE backed file reflecting memory limits of the container</li>
|
||||
</ul>
|
||||
|
||||
|
||||
<h3><a id="devnodes">Device nodes</a></h3>
|
||||
|
||||
<p>
|
||||
The container init process will be started with <code>CAP_MKNOD</code>
|
||||
capability removed and blocked from re-acquiring it. As such it will
|
||||
not be able to create any device nodes in <code>/dev</code> or anywhere
|
||||
else in its filesystems. Libvirt itself will take care of pre-populating
|
||||
the <code>/dev</code> filesystem with any devices that the container
|
||||
is authorized to use. The current devices that will be made available
|
||||
to all containers are
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><code>/dev/zero</code></li>
|
||||
<li><code>/dev/null</code></li>
|
||||
<li><code>/dev/full</code></li>
|
||||
<li><code>/dev/random</code></li>
|
||||
<li><code>/dev/urandom</code></li>
|
||||
<li><code>/dev/stdin</code> symlinked to <code>/proc/self/fd/0</code></li>
|
||||
<li><code>/dev/stdout</code> symlinked to <code>/proc/self/fd/1</code></li>
|
||||
<li><code>/dev/stderr</code> symlinked to <code>/proc/self/fd/2</code></li>
|
||||
<li><code>/dev/fd</code> symlinked to <code>/proc/self/fd</code></li>
|
||||
<li><code>/dev/ptmx</code> symlinked to <code>/dev/pts/ptmx</code></li>
|
||||
<li><code>/dev/console</code> symlinked to <code>/dev/pts/0</code></li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
In addition, for every console defined in the guest configuration,
|
||||
a symlink will be created from <code>/dev/ttyN</code> symlinked to
|
||||
the corresponding <code>/dev/pts/M</code> pseudo TTY device. The
|
||||
first console will be <code>/dev/tty1</code>, with further consoles
|
||||
numbered incrementally from there.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Since /dev/ttyN and /dev/console are linked to the pts devices. The
|
||||
tty device of login program is pts device. The pam module securetty
|
||||
may prevent root user from logging in container. If you want root
|
||||
user to log in container successfully, add the pts device to the file
|
||||
/etc/securetty of container.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Further block or character devices will be made available to containers
|
||||
depending on their configuration.
|
||||
</p>
|
||||
|
||||
<h2><a id="security">Security considerations</a></h2>
|
||||
|
||||
<p>
|
||||
The libvirt LXC driver is fairly flexible in how it can be configured,
|
||||
and as such does not enforce a requirement for strict security
|
||||
separation between a container and the host. This allows it to be used
|
||||
in scenarios where only resource control capabilities are important,
|
||||
and resource sharing is desired. Applications wishing to ensure secure
|
||||
isolation between a container and the host must ensure that they are
|
||||
writing a suitable configuration.
|
||||
</p>
|
||||
|
||||
<h3><a id="securenetworking">Network isolation</a></h3>
|
||||
|
||||
<p>
|
||||
If the guest configuration does not list any network interfaces,
|
||||
the <code>network</code> namespace will not be activated, and thus
|
||||
the container will see all the host's network interfaces. This will
|
||||
allow apps in the container to bind to/connect from TCP/UDP addresses
|
||||
and ports from the host OS. It also allows applications to access
|
||||
UNIX domain sockets associated with the host OS, which are in the
|
||||
abstract namespace. If access to UNIX domains sockets in the abstract
|
||||
namespace is not wanted, then applications should set the
|
||||
<code><privnet/></code> flag in the
|
||||
<code><features>....</features></code> element.
|
||||
</p>
|
||||
|
||||
<h3><a id="securefs">Filesystem isolation</a></h3>
|
||||
|
||||
<p>
|
||||
If the guest configuration does not list any filesystems, then
|
||||
the container will be set up with a root filesystem that matches
|
||||
the host's root filesystem. As noted earlier, only a few locations
|
||||
such as <code>/dev</code>, <code>/proc</code> and <code>/sys</code>
|
||||
will be altered. This means that, in the absence of restrictions
|
||||
from sVirt, a process running as user/group N:M inside the container
|
||||
will be able to access almost exactly the same files as a process
|
||||
running as user/group N:M in the host.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
There are multiple options for restricting this. It is possible to
|
||||
simply map the existing root filesystem through to the container in
|
||||
read-only mode. Alternatively a completely separate root filesystem
|
||||
can be configured for the guest. In both cases, further sub-mounts
|
||||
can be applied to customize the content that is made visible. Note
|
||||
that in the absence of sVirt controls, it is still possible for the
|
||||
root user in a container to unmount any sub-mounts applied. The user
|
||||
namespace feature can also be used to restrict access to files based
|
||||
on the UID/GID mappings.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Sharing the host filesystem tree, also allows applications to access
|
||||
UNIX domains sockets associated with the host OS, which are in the
|
||||
filesystem namespaces. It should be noted that a number of init
|
||||
systems including at least <code>systemd</code> and <code>upstart</code>
|
||||
have UNIX domain socket which are used to control their operation.
|
||||
Thus, if the directory/filesystem holding their UNIX domain socket is
|
||||
exposed to the container, it will be possible for a user in the container
|
||||
to invoke operations on the init service in the same way it could if
|
||||
outside the container. This also applies to other applications in the
|
||||
host which use UNIX domain sockets in the filesystem, such as DBus,
|
||||
Libvirtd, and many more. If this is not desired, then applications
|
||||
should either specify the UID/GID mapping in the configuration to
|
||||
enable user namespaces and thus block access to the UNIX domain socket
|
||||
based on permissions, or should ensure the relevant directories have
|
||||
a bind mount to hide them. This is particularly important for the
|
||||
<code>/run</code> or <code>/var/run</code> directories.
|
||||
</p>
|
||||
|
||||
|
||||
<h3><a id="secureusers">User and group isolation</a></h3>
|
||||
|
||||
<p>
|
||||
If the guest configuration does not list any ID mapping, then the
|
||||
user and group IDs used inside the container will match those used
|
||||
outside the container. In addition, the capabilities associated with
|
||||
a process in the container will infer the same privileges they would
|
||||
for a process in the host. This has obvious implications for security,
|
||||
since a root user inside the container will be able to access any
|
||||
file owned by root that is visible to the container, and perform more
|
||||
or less any privileged kernel operation. In the absence of additional
|
||||
protection from sVirt, this means that the root user inside a container
|
||||
is effectively as powerful as the root user in the host. There is no
|
||||
security isolation of the root user.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The ID mapping facility was introduced to allow for stricter control
|
||||
over the privileges of users inside the container. It allows apps to
|
||||
define rules such as "user ID 0 in the container maps to user ID 1000
|
||||
in the host". In addition the privileges associated with capabilities
|
||||
are somewhat reduced so that they cannot be used to escape from the
|
||||
container environment. A full description of user namespaces is outside
|
||||
the scope of this document, however LWN has
|
||||
<a href="https://lwn.net/Articles/532593/">a good write-up on the topic</a>.
|
||||
From the libvirt point of view, the key thing to remember is that defining
|
||||
an ID mapping for users and groups in the container XML configuration
|
||||
causes libvirt to activate the user namespace feature.
|
||||
</p>
|
||||
|
||||
|
||||
<h2><a id="configFiles">Location of configuration files</a></h2>
|
||||
|
||||
<p>
|
||||
The LXC driver comes with sane default values. However, during its
|
||||
initialization it reads a configuration file which offers system
|
||||
administrator to override some of that default. The file is located
|
||||
under <code>/etc/libvirt/lxc.conf</code>
|
||||
</p>
|
||||
|
||||
|
||||
<h2><a id="activation">Systemd Socket Activation Integration</a></h2>
|
||||
|
||||
<p>
|
||||
The libvirt LXC driver provides the ability to pass across pre-opened file
|
||||
descriptors when starting LXC guests. This allows for libvirt LXC to support
|
||||
systemd's <a href="http://0pointer.de/blog/projects/socket-activated-containers.html">socket
|
||||
activation capability</a>, where an incoming client connection
|
||||
in the host OS will trigger the startup of a container, which runs another
|
||||
copy of systemd which gets passed the server socket, and then activates the
|
||||
actual service handler in the container.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Let us assume that you already have a LXC guest created, running
|
||||
a systemd instance as PID 1 inside the container, which has an
|
||||
SSHD service configured. The goal is to automatically activate
|
||||
the container when the first SSH connection is made. The first
|
||||
step is to create a couple of unit files for the host OS systemd
|
||||
instance. The <code>/etc/systemd/system/mycontainer.service</code>
|
||||
unit file specifies how systemd will start the libvirt LXC container
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
[Unit]
|
||||
Description=My little container
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/bin/virsh -c lxc:///system start --pass-fds 3 mycontainer
|
||||
ExecStop=/usr/bin/virsh -c lxc:///system destroy mycontainer
|
||||
Type=oneshot
|
||||
RemainAfterExit=yes
|
||||
KillMode=none
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
The <code>--pass-fds 3</code> argument specifies that the file
|
||||
descriptor number 3 that <code>virsh</code> inherits from systemd,
|
||||
is to be passed into the container. Since <code>virsh</code> will
|
||||
exit immediately after starting the container, the <code>RemainAfterExit</code>
|
||||
and <code>KillMode</code> settings must be altered from their defaults.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Next, the <code>/etc/systemd/system/mycontainer.socket</code> unit
|
||||
file is created to get the host systemd to listen on port 23 for
|
||||
TCP connections. When this unit file is activated by the first
|
||||
incoming connection, it will cause the <code>mycontainer.service</code>
|
||||
unit to be activated with the FD corresponding to the listening TCP
|
||||
socket passed in as FD 3.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
[Unit]
|
||||
Description=The SSH socket of my little container
|
||||
|
||||
[Socket]
|
||||
ListenStream=23
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
Port 23 was picked here so that the container doesn't conflict
|
||||
with the host's SSH which is on the normal port 22. That's it
|
||||
in terms of host side configuration.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Inside the container, the <code>/etc/systemd/system/sshd.socket</code>
|
||||
unit file must be created
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
[Unit]
|
||||
Description=SSH Socket for Per-Connection Servers
|
||||
|
||||
[Socket]
|
||||
ListenStream=23
|
||||
Accept=yes
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
The <code>ListenStream</code> value listed in this unit file, must
|
||||
match the value used in the host file. When systemd in the container
|
||||
receives the pre-opened FD from libvirt during container startup, it
|
||||
looks at the <code>ListenStream</code> values to figure out which
|
||||
FD to give to which service. The actual service to start is defined
|
||||
by a correspondingly named <code>/etc/systemd/system/sshd@.service</code>
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
[Unit]
|
||||
Description=SSH Per-Connection Server for %I
|
||||
|
||||
[Service]
|
||||
ExecStart=-/usr/sbin/sshd -i
|
||||
StandardInput=socket
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
Finally, make sure this SSH service is set to start on boot of the container,
|
||||
by running the following command inside the container:
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# mkdir -p /etc/systemd/system/sockets.target.wants/
|
||||
# ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
This example shows how to activate the container based on an incoming
|
||||
SSH connection. If the container was also configured to have an httpd
|
||||
service, it may be desirable to activate it upon either an httpd or a
|
||||
sshd connection attempt. In this case, the <code>mycontainer.socket</code>
|
||||
file in the host would simply list multiple socket ports. Inside the
|
||||
container a separate <code>xxxxx.socket</code> file would need to be
|
||||
created for each service, with a corresponding <code>ListenStream</code>
|
||||
value set.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
<h2>Container configuration</h2>
|
||||
|
||||
<h3>Init process</h3>
|
||||
|
||||
<h3>Console devices</h3>
|
||||
|
||||
<h3>Filesystem devices</h3>
|
||||
|
||||
<h3>Disk devices</h3>
|
||||
|
||||
<h3>Block devices</h3>
|
||||
|
||||
<h3>USB devices</h3>
|
||||
|
||||
<h3>Character devices</h3>
|
||||
|
||||
<h3>Network devices</h3>
|
||||
-->
|
||||
|
||||
<h2>Container security</h2>
|
||||
|
||||
<h3>sVirt SELinux</h3>
|
||||
|
||||
<p>
|
||||
In the absence of the "user" namespace being used, containers cannot
|
||||
be considered secure against exploits of the host OS. The sVirt SELinux
|
||||
driver provides a way to secure containers even when the "user" namespace
|
||||
is not used. The cost is that writing a policy to allow execution of
|
||||
arbitrary OS is not practical. The SELinux sVirt policy is typically
|
||||
tailored to work with a simpler application confinement use case,
|
||||
as provided by the "libvirt-sandbox" project.
|
||||
</p>
|
||||
|
||||
<h3>Auditing</h3>
|
||||
|
||||
<p>
|
||||
The LXC driver is integrated with libvirt's auditing subsystem, which
|
||||
causes audit messages to be logged whenever there is an operation
|
||||
performed against a container which has impact on host resources.
|
||||
So for example, start/stop, device hotplug will all log audit messages
|
||||
providing details about what action occurred and any resources
|
||||
associated with it. There are the following 3 types of audit messages
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li><code>VIRT_MACHINE_ID</code> - details of the SELinux process and
|
||||
image security labels assigned to the container.</li>
|
||||
<li><code>VIRT_CONTROL</code> - details of an action / operation
|
||||
performed against a container. There are the following types of
|
||||
operation
|
||||
<ul>
|
||||
<li><code>op=start</code> - a container has been started. Provides
|
||||
the machine name, uuid and PID of the <code>libvirt_lxc</code>
|
||||
controller process</li>
|
||||
<li><code>op=init</code> - the init PID of the container has been
|
||||
started. Provides the machine name, uuid and PID of the
|
||||
<code>libvirt_lxc</code> controller process and PID of the
|
||||
init process (in the host PID namespace)</li>
|
||||
<li><code>op=stop</code> - a container has been stopped. Provides
|
||||
the machine name, uuid</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li><code>VIRT_RESOURCE</code> - details of a host resource
|
||||
associated with a container action.</li>
|
||||
</ul>
|
||||
|
||||
<h3>Device access</h3>
|
||||
|
||||
<p>
|
||||
All containers are launched with the CAP_MKNOD capability cleared
|
||||
and removed from the bounding set. Libvirt will ensure that the
|
||||
/dev filesystem is pre-populated with all devices that a container
|
||||
is allowed to use. In addition, the cgroup "device" controller is
|
||||
configured to block read/write/mknod from all devices except those
|
||||
that a container is authorized to use.
|
||||
</p>
|
||||
|
||||
<h2><a id="exconfig">Example configurations</a></h2>
|
||||
|
||||
<h3>Example config version 1</h3>
|
||||
<p></p>
|
||||
<pre>
|
||||
<domain type='lxc'>
|
||||
<name>vm1</name>
|
||||
<memory>500000</memory>
|
||||
<os>
|
||||
<type>exe</type>
|
||||
<init>/bin/sh</init>
|
||||
</os>
|
||||
<vcpu>1</vcpu>
|
||||
<clock offset='utc'/>
|
||||
<on_poweroff>destroy</on_poweroff>
|
||||
<on_reboot>restart</on_reboot>
|
||||
<on_crash>destroy</on_crash>
|
||||
<devices>
|
||||
<emulator>/usr/libexec/libvirt_lxc</emulator>
|
||||
<interface type='network'>
|
||||
<source network='default'/>
|
||||
</interface>
|
||||
<console type='pty' />
|
||||
</devices>
|
||||
</domain>
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
In the <emulator> element, be sure you specify the correct path
|
||||
to libvirt_lxc, if it does not live in /usr/libexec on your system.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The next example assumes there is a private root filesystem
|
||||
(perhaps hand-crafted using busybox, or installed from media,
|
||||
debootstrap, whatever) under /opt/vm-1-root:
|
||||
</p>
|
||||
<p></p>
|
||||
<pre>
|
||||
<domain type='lxc'>
|
||||
<name>vm1</name>
|
||||
<memory>32768</memory>
|
||||
<os>
|
||||
<type>exe</type>
|
||||
<init>/init</init>
|
||||
</os>
|
||||
<vcpu>1</vcpu>
|
||||
<clock offset='utc'/>
|
||||
<on_poweroff>destroy</on_poweroff>
|
||||
<on_reboot>restart</on_reboot>
|
||||
<on_crash>destroy</on_crash>
|
||||
<devices>
|
||||
<emulator>/usr/libexec/libvirt_lxc</emulator>
|
||||
<filesystem type='mount'>
|
||||
<source dir='/opt/vm-1-root'/>
|
||||
<target dir='/'/>
|
||||
</filesystem>
|
||||
<interface type='network'>
|
||||
<source network='default'/>
|
||||
</interface>
|
||||
<console type='pty' />
|
||||
</devices>
|
||||
</domain>
|
||||
</pre>
|
||||
|
||||
<h2><a id="capabilities">Altering the available capabilities</a></h2>
|
||||
|
||||
<p>
|
||||
By default the libvirt LXC driver drops some capabilities among which CAP_MKNOD.
|
||||
However <span class="since">since 1.2.6</span> libvirt can be told to keep or
|
||||
drop some capabilities using a domain configuration like the following:
|
||||
</p>
|
||||
<pre>
|
||||
...
|
||||
<features>
|
||||
<capabilities policy='default'>
|
||||
<mknod state='on'/>
|
||||
<sys_chroot state='off'/>
|
||||
</capabilities>
|
||||
</features>
|
||||
...
|
||||
</pre>
|
||||
<p>
|
||||
The capabilities children elements are named after the capabilities as defined in
|
||||
<code>man 7 capabilities</code>. An <code>off</code> state tells libvirt to drop the
|
||||
capability, while an <code>on</code> state will force to keep the capability even though
|
||||
this one is dropped by default.
|
||||
</p>
|
||||
<p>
|
||||
The <code>policy</code> attribute can be one of <code>default</code>, <code>allow</code>
|
||||
or <code>deny</code>. It defines the default rules for capabilities: either keep the
|
||||
default behavior that is dropping a few selected capabilities, or keep all capabilities
|
||||
or drop all capabilities. The interest of <code>allow</code> and <code>deny</code> is that
|
||||
they guarantee that all capabilities will be kept (or removed) even if new ones are added
|
||||
later.
|
||||
</p>
|
||||
<p>
|
||||
The following example, drops all capabilities but CAP_MKNOD:
|
||||
</p>
|
||||
<pre>
|
||||
...
|
||||
<features>
|
||||
<capabilities policy='deny'>
|
||||
<mknod state='on'/>
|
||||
</capabilities>
|
||||
</features>
|
||||
...
|
||||
</pre>
|
||||
<p>
|
||||
Note that allowing capabilities that are normally dropped by default can seriously
|
||||
affect the security of the container and the host.
|
||||
</p>
|
||||
|
||||
<h2><a id="share">Inherit namespaces</a></h2>
|
||||
|
||||
<p>
|
||||
Libvirt allows you to inherit the namespace from container/process just like lxc tools
|
||||
or docker provides to share the network namespace. The following can be used to share
|
||||
required namespaces. If we want to share only one then the other namespaces can be ignored.
|
||||
The netns option is specific to sharenet. It can be used in cases we want to use existing network namespace
|
||||
rather than creating new network namespace for the container. In this case privnet option will be
|
||||
ignored.
|
||||
</p>
|
||||
<pre>
|
||||
<domain type='lxc' xmlns:lxc='http://libvirt.org/schemas/domain/lxc/1.0'>
|
||||
...
|
||||
<lxc:namespace>
|
||||
<lxc:sharenet type='netns' value='red'/>
|
||||
<lxc:shareuts type='name' value='container1'/>
|
||||
<lxc:shareipc type='pid' value='12345'/>
|
||||
</lxc:namespace>
|
||||
</domain>
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
The use of namespace passthrough requires libvirt >= 1.2.19
|
||||
</p>
|
||||
|
||||
<h2><a id="usage">Container usage / management</a></h2>
|
||||
|
||||
<p>
|
||||
As with any libvirt virtualization driver, LXC containers can be
|
||||
managed via a wide variety of libvirt based tools. At the lowest
|
||||
level the <code>virsh</code> command can be used to perform many
|
||||
tasks, by passing the <code>-c lxc:///system</code> argument. As an
|
||||
alternative to repeating the URI with every command, the <code>LIBVIRT_DEFAULT_URI</code>
|
||||
environment variable can be set to <code>lxc:///system</code>. The
|
||||
examples that follow outline some common operations with virsh
|
||||
and LXC. For further details about usage of virsh consult its
|
||||
manual page.
|
||||
</p>
|
||||
|
||||
<h3><a id="usageSave">Defining (saving) container configuration</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh define</code> command takes an XML configuration
|
||||
document and loads it into libvirt, saving the configuration on disk
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system define myguest.xml
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageView">Viewing container configuration</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh dumpxml</code> command can be used to view the
|
||||
current XML configuration of a container. By default the XML
|
||||
output reflects the current state of the container. If the
|
||||
container is running, it is possible to explicitly request the
|
||||
persistent configuration, instead of the current live configuration
|
||||
using the <code>--inactive</code> flag
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system dumpxml myguest
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageStart">Starting containers</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh start</code> command can be used to start a
|
||||
container from a previously defined persistent configuration
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system start myguest
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
It is also possible to start so called "transient" containers,
|
||||
which do not require a persistent configuration to be saved
|
||||
by libvirt, using the <code>virsh create</code> command.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system create myguest.xml
|
||||
</pre>
|
||||
|
||||
|
||||
<h3><a id="usageStop">Stopping containers</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh shutdown</code> command can be used
|
||||
to request a graceful shutdown of the container. By default
|
||||
this command will first attempt to send a message to the
|
||||
init process via the <code>/dev/initctl</code> device node.
|
||||
If no such device node exists, then it will send SIGTERM
|
||||
to PID 1 inside the container.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system shutdown myguest
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
If the container does not respond to the graceful shutdown
|
||||
request, it can be forcibly stopped using the <code>virsh destroy</code>
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system destroy myguest
|
||||
</pre>
|
||||
|
||||
|
||||
<h3><a id="usageReboot">Rebooting a container</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh reboot</code> command can be used
|
||||
to request a graceful shutdown of the container. By default
|
||||
this command will first attempt to send a message to the
|
||||
init process via the <code>/dev/initctl</code> device node.
|
||||
If no such device node exists, then it will send SIGHUP
|
||||
to PID 1 inside the container.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system reboot myguest
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageDelete">Undefining (deleting) a container configuration</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh undefine</code> command can be used to delete the
|
||||
persistent configuration of a container. If the guest is currently
|
||||
running, this will turn it into a "transient" guest.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system undefine myguest
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageConnect">Connecting to a container console</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh console</code> command can be used to connect
|
||||
to the text console associated with a container.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system console myguest
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
If the container has been configured with multiple console devices,
|
||||
then the <code>--devname</code> argument can be used to choose the
|
||||
console to connect to.
|
||||
In LXC, multiple consoles will be named
|
||||
as 'console0', 'console1', 'console2', etc.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system console myguest --devname console1
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageEnter">Running commands in a container</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh lxc-enter-namespace</code> command can be used
|
||||
to enter the namespaces and security context of a container
|
||||
and then execute an arbitrary command.
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system lxc-enter-namespace myguest -- /bin/ls -al /dev
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageTop">Monitoring container utilization</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virt-top</code> command can be used to monitor the
|
||||
activity and resource utilization of all containers on a
|
||||
host
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virt-top -c lxc:///system
|
||||
</pre>
|
||||
|
||||
<h3><a id="usageConvert">Converting LXC container configuration</a></h3>
|
||||
|
||||
<p>
|
||||
The <code>virsh domxml-from-native</code> command can be used to convert
|
||||
most of the LXC container configuration into a domain XML fragment
|
||||
</p>
|
||||
|
||||
<pre>
|
||||
# virsh -c lxc:///system domxml-from-native lxc-tools /var/lib/lxc/myguest/config
|
||||
</pre>
|
||||
|
||||
<p>
|
||||
This conversion has some limitations due to the fact that the
|
||||
domxml-from-native command output has to be independent of the host. Here
|
||||
are a few things to take care of before converting:
|
||||
</p>
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
Replace the fstab file referenced by <tt>lxc.mount</tt> by the corresponding
|
||||
lxc.mount.entry lines.
|
||||
</li>
|
||||
<li>
|
||||
Replace all relative sizes of tmpfs mount entries to absolute sizes. Also
|
||||
make sure that tmpfs entries all have a size option (default is 50%).
|
||||
</li>
|
||||
<li>
|
||||
Define <tt>lxc.cgroup.memory.limit_in_bytes</tt> to properly limit the memory
|
||||
available to the container. The conversion will use 64MiB as the default.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
</body>
|
||||
</html>
|
670
docs/drvlxc.rst
Normal file
670
docs/drvlxc.rst
Normal file
@ -0,0 +1,670 @@
|
||||
.. role:: since
|
||||
|
||||
====================
|
||||
LXC container driver
|
||||
====================
|
||||
|
||||
.. contents::
|
||||
|
||||
The libvirt LXC driver manages "Linux Containers". At their simplest, containers
|
||||
can just be thought of as a collection of processes, separated from the main
|
||||
host processes via a set of resource namespaces and constrained via control
|
||||
groups resource tunables. The libvirt LXC driver has no dependency on the LXC
|
||||
userspace tools hosted on sourceforge.net. It directly utilizes the relevant
|
||||
kernel features to build the container environment. This allows for sharing of
|
||||
many libvirt technologies across both the QEMU/KVM and LXC drivers. In
|
||||
particular sVirt for mandatory access control, auditing of operations,
|
||||
integration with control groups and many other features.
|
||||
|
||||
Control groups Requirements
|
||||
---------------------------
|
||||
|
||||
In order to control the resource usage of processes inside containers, the
|
||||
libvirt LXC driver requires that certain cgroups controllers are mounted on the
|
||||
host OS. The minimum required controllers are 'cpuacct', 'memory' and 'devices',
|
||||
while recommended extra controllers are 'cpu', 'freezer' and 'blkio'. Libvirt
|
||||
will not mount the cgroups filesystem itself, leaving this up to the init system
|
||||
to take care of. Systemd will do the right thing in this respect, while for
|
||||
other init systems the ``cgconfig`` init service will be required. For further
|
||||
information, consult the general libvirt `cgroups
|
||||
documentation <cgroups.html>`__.
|
||||
|
||||
Namespace requirements
|
||||
----------------------
|
||||
|
||||
In order to separate processes inside a container from those in the primary
|
||||
"host" OS environment, the libvirt LXC driver requires that certain kernel
|
||||
namespaces are compiled in. Libvirt currently requires the 'mount', 'ipc',
|
||||
'pid', and 'uts' namespaces to be available. If separate network interfaces are
|
||||
desired, then the 'net' namespace is required. If the guest configuration
|
||||
declares a `UID or GID mapping <formatdomain.html#elementsOSContainer>`__, the
|
||||
'user' namespace will be enabled to apply these. **A suitably configured UID/GID
|
||||
mapping is a pre-requisite to making containers secure, in the absence of sVirt
|
||||
confinement.**
|
||||
|
||||
Default container setup
|
||||
-----------------------
|
||||
|
||||
Command line arguments
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When the container "init" process is started, it will typically not be given any
|
||||
command line arguments (eg the equivalent of the bootloader args visible in
|
||||
``/proc/cmdline``). If any arguments are desired, then must be explicitly set in
|
||||
the container XML configuration via one or more ``initarg`` elements. For
|
||||
example, to run ``systemd --unit emergency.service`` would use the following XML
|
||||
|
||||
::
|
||||
|
||||
<os>
|
||||
<type arch='x86_64'>exe</type>
|
||||
<init>/bin/systemd</init>
|
||||
<initarg>--unit</initarg>
|
||||
<initarg>emergency.service</initarg>
|
||||
</os>
|
||||
|
||||
Environment variables
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When the container "init" process is started, it will be given several useful
|
||||
environment variables. The following standard environment variables are mandated
|
||||
by `systemd container
|
||||
interface <https://www.freedesktop.org/wiki/Software/systemd/ContainerInterface>`__
|
||||
to be provided by all container technologies on Linux.
|
||||
|
||||
``container``
|
||||
The fixed string ``libvirt-lxc`` to identify libvirt as the creator
|
||||
``container_uuid``
|
||||
The UUID assigned to the container by libvirt
|
||||
``PATH``
|
||||
The fixed string ``/bin:/usr/bin``
|
||||
``TERM``
|
||||
The fixed string ``linux``
|
||||
``HOME``
|
||||
The fixed string ``/``
|
||||
|
||||
In addition to the standard variables, the following libvirt specific
|
||||
environment variables are also provided
|
||||
|
||||
``LIBVIRT_LXC_NAME``
|
||||
The name assigned to the container by libvirt
|
||||
``LIBVIRT_LXC_UUID``
|
||||
The UUID assigned to the container by libvirt
|
||||
``LIBVIRT_LXC_CMDLINE``
|
||||
The unparsed command line arguments specified in the container configuration.
|
||||
Use of this is discouraged, in favour of passing arguments directly to the
|
||||
container init process via the ``initarg`` config element.
|
||||
|
||||
Filesystem mounts
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
In the absence of any explicit configuration, the container will inherit the
|
||||
host OS filesystem mounts. A number of mount points will be made read only, or
|
||||
re-mounted with new instances to provide container specific data. The following
|
||||
special mounts are setup by libvirt
|
||||
|
||||
- ``/dev`` a new "tmpfs" pre-populated with authorized device nodes
|
||||
- ``/dev/pts`` a new private "devpts" instance for console devices
|
||||
- ``/sys`` the host "sysfs" instance remounted read-only
|
||||
- ``/proc`` a new instance of the "proc" filesystem
|
||||
- ``/proc/sys`` the host "/proc/sys" bind-mounted read-only
|
||||
- ``/sys/fs/selinux`` the host "selinux" instance remounted read-only
|
||||
- ``/sys/fs/cgroup/NNNN`` the host cgroups controllers bind-mounted to only
|
||||
expose the sub-tree associated with the container
|
||||
- ``/proc/meminfo`` a FUSE backed file reflecting memory limits of the
|
||||
container
|
||||
|
||||
Device nodes
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The container init process will be started with ``CAP_MKNOD`` capability removed
|
||||
and blocked from re-acquiring it. As such it will not be able to create any
|
||||
device nodes in ``/dev`` or anywhere else in its filesystems. Libvirt itself
|
||||
will take care of pre-populating the ``/dev`` filesystem with any devices that
|
||||
the container is authorized to use. The current devices that will be made
|
||||
available to all containers are
|
||||
|
||||
- ``/dev/zero``
|
||||
- ``/dev/null``
|
||||
- ``/dev/full``
|
||||
- ``/dev/random``
|
||||
- ``/dev/urandom``
|
||||
- ``/dev/stdin`` symlinked to ``/proc/self/fd/0``
|
||||
- ``/dev/stdout`` symlinked to ``/proc/self/fd/1``
|
||||
- ``/dev/stderr`` symlinked to ``/proc/self/fd/2``
|
||||
- ``/dev/fd`` symlinked to ``/proc/self/fd``
|
||||
- ``/dev/ptmx`` symlinked to ``/dev/pts/ptmx``
|
||||
- ``/dev/console`` symlinked to ``/dev/pts/0``
|
||||
|
||||
In addition, for every console defined in the guest configuration, a symlink
|
||||
will be created from ``/dev/ttyN`` symlinked to the corresponding ``/dev/pts/M``
|
||||
pseudo TTY device. The first console will be ``/dev/tty1``, with further
|
||||
consoles numbered incrementally from there.
|
||||
|
||||
Since /dev/ttyN and /dev/console are linked to the pts devices. The tty device
|
||||
of login program is pts device. The pam module securetty may prevent root user
|
||||
from logging in container. If you want root user to log in container
|
||||
successfully, add the pts device to the file /etc/securetty of container.
|
||||
|
||||
Further block or character devices will be made available to containers
|
||||
depending on their configuration.
|
||||
|
||||
Security considerations
|
||||
-----------------------
|
||||
|
||||
The libvirt LXC driver is fairly flexible in how it can be configured, and as
|
||||
such does not enforce a requirement for strict security separation between a
|
||||
container and the host. This allows it to be used in scenarios where only
|
||||
resource control capabilities are important, and resource sharing is desired.
|
||||
Applications wishing to ensure secure isolation between a container and the host
|
||||
must ensure that they are writing a suitable configuration.
|
||||
|
||||
Network isolation
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
If the guest configuration does not list any network interfaces, the ``network``
|
||||
namespace will not be activated, and thus the container will see all the host's
|
||||
network interfaces. This will allow apps in the container to bind to/connect
|
||||
from TCP/UDP addresses and ports from the host OS. It also allows applications
|
||||
to access UNIX domain sockets associated with the host OS, which are in the
|
||||
abstract namespace. If access to UNIX domains sockets in the abstract namespace
|
||||
is not wanted, then applications should set the ``<privnet/>`` flag in the
|
||||
``<features>....</features>`` element.
|
||||
|
||||
Filesystem isolation
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If the guest configuration does not list any filesystems, then the container
|
||||
will be set up with a root filesystem that matches the host's root filesystem.
|
||||
As noted earlier, only a few locations such as ``/dev``, ``/proc`` and ``/sys``
|
||||
will be altered. This means that, in the absence of restrictions from sVirt, a
|
||||
process running as user/group N:M inside the container will be able to access
|
||||
almost exactly the same files as a process running as user/group N:M in the
|
||||
host.
|
||||
|
||||
There are multiple options for restricting this. It is possible to simply map
|
||||
the existing root filesystem through to the container in read-only mode.
|
||||
Alternatively a completely separate root filesystem can be configured for the
|
||||
guest. In both cases, further sub-mounts can be applied to customize the content
|
||||
that is made visible. Note that in the absence of sVirt controls, it is still
|
||||
possible for the root user in a container to unmount any sub-mounts applied. The
|
||||
user namespace feature can also be used to restrict access to files based on the
|
||||
UID/GID mappings.
|
||||
|
||||
Sharing the host filesystem tree, also allows applications to access UNIX
|
||||
domains sockets associated with the host OS, which are in the filesystem
|
||||
namespaces. It should be noted that a number of init systems including at least
|
||||
``systemd`` and ``upstart`` have UNIX domain socket which are used to control
|
||||
their operation. Thus, if the directory/filesystem holding their UNIX domain
|
||||
socket is exposed to the container, it will be possible for a user in the
|
||||
container to invoke operations on the init service in the same way it could if
|
||||
outside the container. This also applies to other applications in the host which
|
||||
use UNIX domain sockets in the filesystem, such as DBus, Libvirtd, and many
|
||||
more. If this is not desired, then applications should either specify the
|
||||
UID/GID mapping in the configuration to enable user namespaces and thus block
|
||||
access to the UNIX domain socket based on permissions, or should ensure the
|
||||
relevant directories have a bind mount to hide them. This is particularly
|
||||
important for the ``/run`` or ``/var/run`` directories.
|
||||
|
||||
User and group isolation
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If the guest configuration does not list any ID mapping, then the user and group
|
||||
IDs used inside the container will match those used outside the container. In
|
||||
addition, the capabilities associated with a process in the container will infer
|
||||
the same privileges they would for a process in the host. This has obvious
|
||||
implications for security, since a root user inside the container will be able
|
||||
to access any file owned by root that is visible to the container, and perform
|
||||
more or less any privileged kernel operation. In the absence of additional
|
||||
protection from sVirt, this means that the root user inside a container is
|
||||
effectively as powerful as the root user in the host. There is no security
|
||||
isolation of the root user.
|
||||
|
||||
The ID mapping facility was introduced to allow for stricter control over the
|
||||
privileges of users inside the container. It allows apps to define rules such as
|
||||
"user ID 0 in the container maps to user ID 1000 in the host". In addition the
|
||||
privileges associated with capabilities are somewhat reduced so that they cannot
|
||||
be used to escape from the container environment. A full description of user
|
||||
namespaces is outside the scope of this document, however LWN has `a good
|
||||
write-up on the topic <https://lwn.net/Articles/532593/>`__. From the libvirt
|
||||
point of view, the key thing to remember is that defining an ID mapping for
|
||||
users and groups in the container XML configuration causes libvirt to activate
|
||||
the user namespace feature.
|
||||
|
||||
Location of configuration files
|
||||
-------------------------------
|
||||
|
||||
The LXC driver comes with sane default values. However, during its
|
||||
initialization it reads a configuration file which offers system administrator
|
||||
to override some of that default. The file is located under
|
||||
``/etc/libvirt/lxc.conf``
|
||||
|
||||
Systemd Socket Activation Integration
|
||||
-------------------------------------
|
||||
|
||||
The libvirt LXC driver provides the ability to pass across pre-opened file
|
||||
descriptors when starting LXC guests. This allows for libvirt LXC to support
|
||||
systemd's `socket activation
|
||||
capability <http://0pointer.de/blog/projects/socket-activated-containers.html>`__,
|
||||
where an incoming client connection in the host OS will trigger the startup of a
|
||||
container, which runs another copy of systemd which gets passed the server
|
||||
socket, and then activates the actual service handler in the container.
|
||||
|
||||
Let us assume that you already have a LXC guest created, running a systemd
|
||||
instance as PID 1 inside the container, which has an SSHD service configured.
|
||||
The goal is to automatically activate the container when the first SSH
|
||||
connection is made. The first step is to create a couple of unit files for the
|
||||
host OS systemd instance. The ``/etc/systemd/system/mycontainer.service`` unit
|
||||
file specifies how systemd will start the libvirt LXC container
|
||||
|
||||
::
|
||||
|
||||
[Unit]
|
||||
Description=My little container
|
||||
|
||||
[Service]
|
||||
ExecStart=/usr/bin/virsh -c lxc:///system start --pass-fds 3 mycontainer
|
||||
ExecStop=/usr/bin/virsh -c lxc:///system destroy mycontainer
|
||||
Type=oneshot
|
||||
RemainAfterExit=yes
|
||||
KillMode=none
|
||||
|
||||
The ``--pass-fds 3`` argument specifies that the file descriptor number 3 that
|
||||
``virsh`` inherits from systemd, is to be passed into the container. Since
|
||||
``virsh`` will exit immediately after starting the container, the
|
||||
``RemainAfterExit`` and ``KillMode`` settings must be altered from their
|
||||
defaults.
|
||||
|
||||
Next, the ``/etc/systemd/system/mycontainer.socket`` unit file is created to get
|
||||
the host systemd to listen on port 23 for TCP connections. When this unit file
|
||||
is activated by the first incoming connection, it will cause the
|
||||
``mycontainer.service`` unit to be activated with the FD corresponding to the
|
||||
listening TCP socket passed in as FD 3.
|
||||
|
||||
::
|
||||
|
||||
[Unit]
|
||||
Description=The SSH socket of my little container
|
||||
|
||||
[Socket]
|
||||
ListenStream=23
|
||||
|
||||
Port 23 was picked here so that the container doesn't conflict with the host's
|
||||
SSH which is on the normal port 22. That's it in terms of host side
|
||||
configuration.
|
||||
|
||||
Inside the container, the ``/etc/systemd/system/sshd.socket`` unit file must be
|
||||
created
|
||||
|
||||
::
|
||||
|
||||
[Unit]
|
||||
Description=SSH Socket for Per-Connection Servers
|
||||
|
||||
[Socket]
|
||||
ListenStream=23
|
||||
Accept=yes
|
||||
|
||||
The ``ListenStream`` value listed in this unit file, must match the value used
|
||||
in the host file. When systemd in the container receives the pre-opened FD from
|
||||
libvirt during container startup, it looks at the ``ListenStream`` values to
|
||||
figure out which FD to give to which service. The actual service to start is
|
||||
defined by a correspondingly named ``/etc/systemd/system/sshd@.service``
|
||||
|
||||
::
|
||||
|
||||
[Unit]
|
||||
Description=SSH Per-Connection Server for %I
|
||||
|
||||
[Service]
|
||||
ExecStart=-/usr/sbin/sshd -i
|
||||
StandardInput=socket
|
||||
|
||||
Finally, make sure this SSH service is set to start on boot of the container, by
|
||||
running the following command inside the container:
|
||||
|
||||
::
|
||||
|
||||
# mkdir -p /etc/systemd/system/sockets.target.wants/
|
||||
# ln -s /etc/systemd/system/sshd.socket /etc/systemd/system/sockets.target.wants/
|
||||
|
||||
This example shows how to activate the container based on an incoming SSH
|
||||
connection. If the container was also configured to have an httpd service, it
|
||||
may be desirable to activate it upon either an httpd or a sshd connection
|
||||
attempt. In this case, the ``mycontainer.socket`` file in the host would simply
|
||||
list multiple socket ports. Inside the container a separate ``xxxxx.socket``
|
||||
file would need to be created for each service, with a corresponding
|
||||
``ListenStream`` value set.
|
||||
|
||||
Container security
|
||||
------------------
|
||||
|
||||
sVirt SELinux
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
In the absence of the "user" namespace being used, containers cannot be
|
||||
considered secure against exploits of the host OS. The sVirt SELinux driver
|
||||
provides a way to secure containers even when the "user" namespace is not used.
|
||||
The cost is that writing a policy to allow execution of arbitrary OS is not
|
||||
practical. The SELinux sVirt policy is typically tailored to work with a simpler
|
||||
application confinement use case, as provided by the "libvirt-sandbox" project.
|
||||
|
||||
Auditing
|
||||
~~~~~~~~
|
||||
|
||||
The LXC driver is integrated with libvirt's auditing subsystem, which causes
|
||||
audit messages to be logged whenever there is an operation performed against a
|
||||
container which has impact on host resources. So for example, start/stop, device
|
||||
hotplug will all log audit messages providing details about what action occurred
|
||||
and any resources associated with it. There are the following 3 types of audit
|
||||
messages
|
||||
|
||||
- ``VIRT_MACHINE_ID`` - details of the SELinux process and image security
|
||||
labels assigned to the container.
|
||||
- ``VIRT_CONTROL`` - details of an action / operation performed against a
|
||||
container. There are the following types of operation
|
||||
|
||||
- ``op=start`` - a container has been started. Provides the machine name,
|
||||
uuid and PID of the ``libvirt_lxc`` controller process
|
||||
- ``op=init`` - the init PID of the container has been started. Provides the
|
||||
machine name, uuid and PID of the ``libvirt_lxc`` controller process and
|
||||
PID of the init process (in the host PID namespace)
|
||||
- ``op=stop`` - a container has been stopped. Provides the machine name,
|
||||
uuid
|
||||
|
||||
- ``VIRT_RESOURCE`` - details of a host resource associated with a container
|
||||
action.
|
||||
|
||||
Device access
|
||||
~~~~~~~~~~~~~
|
||||
|
||||
All containers are launched with the CAP_MKNOD capability cleared and removed
|
||||
from the bounding set. Libvirt will ensure that the /dev filesystem is
|
||||
pre-populated with all devices that a container is allowed to use. In addition,
|
||||
the cgroup "device" controller is configured to block read/write/mknod from all
|
||||
devices except those that a container is authorized to use.
|
||||
|
||||
Example configurations
|
||||
----------------------
|
||||
|
||||
Example config version 1
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
::
|
||||
|
||||
<domain type='lxc'>
|
||||
<name>vm1</name>
|
||||
<memory>500000</memory>
|
||||
<os>
|
||||
<type>exe</type>
|
||||
<init>/bin/sh</init>
|
||||
</os>
|
||||
<vcpu>1</vcpu>
|
||||
<clock offset='utc'/>
|
||||
<on_poweroff>destroy</on_poweroff>
|
||||
<on_reboot>restart</on_reboot>
|
||||
<on_crash>destroy</on_crash>
|
||||
<devices>
|
||||
<emulator>/usr/libexec/libvirt_lxc</emulator>
|
||||
<interface type='network'>
|
||||
<source network='default'/>
|
||||
</interface>
|
||||
<console type='pty' />
|
||||
</devices>
|
||||
</domain>
|
||||
|
||||
In the <emulator> element, be sure you specify the correct path to libvirt_lxc,
|
||||
if it does not live in /usr/libexec on your system.
|
||||
|
||||
The next example assumes there is a private root filesystem (perhaps
|
||||
hand-crafted using busybox, or installed from media, debootstrap, whatever)
|
||||
under /opt/vm-1-root:
|
||||
|
||||
::
|
||||
|
||||
<domain type='lxc'>
|
||||
<name>vm1</name>
|
||||
<memory>32768</memory>
|
||||
<os>
|
||||
<type>exe</type>
|
||||
<init>/init</init>
|
||||
</os>
|
||||
<vcpu>1</vcpu>
|
||||
<clock offset='utc'/>
|
||||
<on_poweroff>destroy</on_poweroff>
|
||||
<on_reboot>restart</on_reboot>
|
||||
<on_crash>destroy</on_crash>
|
||||
<devices>
|
||||
<emulator>/usr/libexec/libvirt_lxc</emulator>
|
||||
<filesystem type='mount'>
|
||||
<source dir='/opt/vm-1-root'/>
|
||||
<target dir='/'/>
|
||||
</filesystem>
|
||||
<interface type='network'>
|
||||
<source network='default'/>
|
||||
</interface>
|
||||
<console type='pty' />
|
||||
</devices>
|
||||
</domain>
|
||||
|
||||
Altering the available capabilities
|
||||
-----------------------------------
|
||||
|
||||
By default the libvirt LXC driver drops some capabilities among which CAP_MKNOD.
|
||||
However :since:`since 1.2.6` libvirt can be told to keep or drop some
|
||||
capabilities using a domain configuration like the following:
|
||||
|
||||
::
|
||||
|
||||
...
|
||||
<features>
|
||||
<capabilities policy='default'>
|
||||
<mknod state='on'/>
|
||||
<sys_chroot state='off'/>
|
||||
</capabilities>
|
||||
</features>
|
||||
...
|
||||
|
||||
The capabilities children elements are named after the capabilities as defined
|
||||
in ``man 7 capabilities``. An ``off`` state tells libvirt to drop the
|
||||
capability, while an ``on`` state will force to keep the capability even though
|
||||
this one is dropped by default.
|
||||
|
||||
The ``policy`` attribute can be one of ``default``, ``allow`` or ``deny``. It
|
||||
defines the default rules for capabilities: either keep the default behavior
|
||||
that is dropping a few selected capabilities, or keep all capabilities or drop
|
||||
all capabilities. The interest of ``allow`` and ``deny`` is that they guarantee
|
||||
that all capabilities will be kept (or removed) even if new ones are added
|
||||
later.
|
||||
|
||||
The following example, drops all capabilities but CAP_MKNOD:
|
||||
|
||||
::
|
||||
|
||||
...
|
||||
<features>
|
||||
<capabilities policy='deny'>
|
||||
<mknod state='on'/>
|
||||
</capabilities>
|
||||
</features>
|
||||
...
|
||||
|
||||
Note that allowing capabilities that are normally dropped by default can
|
||||
seriously affect the security of the container and the host.
|
||||
|
||||
Inherit namespaces
|
||||
------------------
|
||||
|
||||
Libvirt allows you to inherit the namespace from container/process just like lxc
|
||||
tools or docker provides to share the network namespace. The following can be
|
||||
used to share required namespaces. If we want to share only one then the other
|
||||
namespaces can be ignored. The netns option is specific to sharenet. It can be
|
||||
used in cases we want to use existing network namespace rather than creating new
|
||||
network namespace for the container. In this case privnet option will be
|
||||
ignored.
|
||||
|
||||
::
|
||||
|
||||
<domain type='lxc' xmlns:lxc='http://libvirt.org/schemas/domain/lxc/1.0'>
|
||||
...
|
||||
<lxc:namespace>
|
||||
<lxc:sharenet type='netns' value='red'/>
|
||||
<lxc:shareuts type='name' value='container1'/>
|
||||
<lxc:shareipc type='pid' value='12345'/>
|
||||
</lxc:namespace>
|
||||
</domain>
|
||||
|
||||
The use of namespace passthrough requires libvirt >= 1.2.19
|
||||
|
||||
Container usage / management
|
||||
----------------------------
|
||||
|
||||
As with any libvirt virtualization driver, LXC containers can be managed via a
|
||||
wide variety of libvirt based tools. At the lowest level the ``virsh`` command
|
||||
can be used to perform many tasks, by passing the ``-c lxc:///system`` argument.
|
||||
As an alternative to repeating the URI with every command, the
|
||||
``LIBVIRT_DEFAULT_URI`` environment variable can be set to ``lxc:///system``.
|
||||
The examples that follow outline some common operations with virsh and LXC. For
|
||||
further details about usage of virsh consult its manual page.
|
||||
|
||||
Defining (saving) container configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh define`` command takes an XML configuration document and loads it
|
||||
into libvirt, saving the configuration on disk
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system define myguest.xml
|
||||
|
||||
Viewing container configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh dumpxml`` command can be used to view the current XML configuration
|
||||
of a container. By default the XML output reflects the current state of the
|
||||
container. If the container is running, it is possible to explicitly request the
|
||||
persistent configuration, instead of the current live configuration using the
|
||||
``--inactive`` flag
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system dumpxml myguest
|
||||
|
||||
Starting containers
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh start`` command can be used to start a container from a previously
|
||||
defined persistent configuration
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system start myguest
|
||||
|
||||
It is also possible to start so called "transient" containers, which do not
|
||||
require a persistent configuration to be saved by libvirt, using the
|
||||
``virsh create`` command.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system create myguest.xml
|
||||
|
||||
Stopping containers
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh shutdown`` command can be used to request a graceful shutdown of the
|
||||
container. By default this command will first attempt to send a message to the
|
||||
init process via the ``/dev/initctl`` device node. If no such device node
|
||||
exists, then it will send SIGTERM to PID 1 inside the container.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system shutdown myguest
|
||||
|
||||
If the container does not respond to the graceful shutdown request, it can be
|
||||
forcibly stopped using the ``virsh destroy``
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system destroy myguest
|
||||
|
||||
Rebooting a container
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh reboot`` command can be used to request a graceful shutdown of the
|
||||
container. By default this command will first attempt to send a message to the
|
||||
init process via the ``/dev/initctl`` device node. If no such device node
|
||||
exists, then it will send SIGHUP to PID 1 inside the container.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system reboot myguest
|
||||
|
||||
Undefining (deleting) a container configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh undefine`` command can be used to delete the persistent
|
||||
configuration of a container. If the guest is currently running, this will turn
|
||||
it into a "transient" guest.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system undefine myguest
|
||||
|
||||
Connecting to a container console
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh console`` command can be used to connect to the text console
|
||||
associated with a container.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system console myguest
|
||||
|
||||
If the container has been configured with multiple console devices, then the
|
||||
``--devname`` argument can be used to choose the console to connect to. In LXC,
|
||||
multiple consoles will be named as 'console0', 'console1', 'console2', etc.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system console myguest --devname console1
|
||||
|
||||
Running commands in a container
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh lxc-enter-namespace`` command can be used to enter the namespaces
|
||||
and security context of a container and then execute an arbitrary command.
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system lxc-enter-namespace myguest -- /bin/ls -al /dev
|
||||
|
||||
Monitoring container utilization
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virt-top`` command can be used to monitor the activity and resource
|
||||
utilization of all containers on a host
|
||||
|
||||
::
|
||||
|
||||
# virt-top -c lxc:///system
|
||||
|
||||
Converting LXC container configuration
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The ``virsh domxml-from-native`` command can be used to convert most of the LXC
|
||||
container configuration into a domain XML fragment
|
||||
|
||||
::
|
||||
|
||||
# virsh -c lxc:///system domxml-from-native lxc-tools /var/lib/lxc/myguest/config
|
||||
|
||||
This conversion has some limitations due to the fact that the domxml-from-native
|
||||
command output has to be independent of the host. Here are a few things to take
|
||||
care of before converting:
|
||||
|
||||
- Replace the fstab file referenced by lxc.mount by the corresponding
|
||||
lxc.mount.entry lines.
|
||||
- Replace all relative sizes of tmpfs mount entries to absolute sizes. Also
|
||||
make sure that tmpfs entries all have a size option (default is 50%).
|
||||
- Define lxc.cgroup.memory.limit_in_bytes to properly limit the memory
|
||||
available to the container. The conversion will use 64MiB as the default.
|
@ -22,7 +22,6 @@ docs_html_in_files = [
|
||||
'csharp',
|
||||
'dbus',
|
||||
'docs',
|
||||
'drvlxc',
|
||||
'drvnodedev',
|
||||
'drvopenvz',
|
||||
'drvsecret',
|
||||
@ -80,6 +79,7 @@ docs_rst_files = [
|
||||
'drvch',
|
||||
'drvesx',
|
||||
'drvhyperv',
|
||||
'drvlxc',
|
||||
'drvqemu',
|
||||
'errors',
|
||||
'formatbackup',
|
||||
|
Loading…
Reference in New Issue
Block a user