Currently you can configure LXC to bind a host directory to
a guest directory, but not to bind a guest directory to a
guest directory. While the guest container init could do
this itself, allowing it in the libvirt XML means a stricter
SELinux policy can be written
Introduce a new syntax for filesystems to allow use of a RAM
filesystem
<filesystem type='ram'>
<source usage='10' units='MiB'/>
<target dir='/mnt'/>
</filesystem>
The usage units default to KiB to limit consumption of host memory.
* docs/formatdomain.html.in: Document new syntax
* docs/schemas/domaincommon.rng: Add new attributes
* src/conf/domain_conf.c: Parsing/formatting of RAM filesystems
* src/lxc/lxc_container.c: Mounting of RAM filesystems
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
when lxcContainerIdentifyCGroups failed, the memory it allocated
has been freed, so we should not free this memory again in
lxcContainerSetupPivortRoot and lxcContainerSetupExtraMounts.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
when libvirt_lxc trigger oom error in lxcContainerGetSubtree
we should free the alloced memory for mounts.
so when lxcContainerGetSubtree failed,we should do some
memory cleanup in lxcContainerUnmountSubtree.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
we alloc the memory for format in lxcContainerMountDetectFilesystem
but without free it in lxcContainerMountFSBlockHelper.
this patch just call VIR_FREE to free it.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
This reverts
commit c16b4c43fc
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Fri May 11 15:09:27 2012 +0100
Avoid LXC pivot root in the root source is still /
This commit broke setup of /dev, because the code which
deals with setting up a private /dev and /dev/pts only
works if you do a pivotroot.
The original intent of avoiding the pivot root was to
try and ensure the new root has a minimumal mount
tree. The better way todo this is to just unmount the
bits we don't want (ie old /proc & /sys subtrees.
So apply the logic from
commit c529b47a75
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Fri May 11 11:35:28 2012 +0100
Trim /proc & /sys subtrees before mounting new instances
to the pivot_root codepath as well
when do remount,the source and target should be the same
values specified in the initial mount() call.
So change fs->dst to src.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Normal practice is for cgroups controllers to be mounted at
/sys/fs/cgroup. When setting up a container, /sys is mounted
with a new sysfs instance, thus we must re-mount all the
cgroups controllers. The complexity is that we must mount
them in the same layout as the host OS. ie if 'cpu' and 'cpuacct'
were mounted at the same location in the host we must preserve
this in the container. Also if any controllers are co-located
we must setup symlinks from the individual controller name to
the co-located mount-point
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Both /proc and /sys may have sub-mounts in them from the host
OS. We must explicitly unmount them all before mounting the
new instance over that location. If we don't then /proc/mounts
will show the sub-mounts as existing, even though nothing will
be able to access them, due to the over-mount.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If the LXC config has a filesystem
<filesystem>
<source dir='/'/>
<target dir='/'/>
</filesystem>
then there is no need to go down the pivot root codepath.
We can simply use the existing root as needed.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently to make sysfs readonly, we remount the existing
instance and then bind it readonly. Unfortunately this means
sysfs is still showing device objects wrt the host OS namespace.
We need it to reflect the container namespace, so we must mount
a completely new instance of it. Do the same for selinuxfs since
there is no benefit to bind mounting & this lets us simplify
the code.
* src/lxc/lxc_container.c: Mount fresh sysfs instance
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of hardcoding use of SELinux contexts in the LXC driver,
switch over to using the official security driver API.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Once lxcContainerSetStdio is invoked, logging will not work as
expected in libvirt_lxc. So make sure this is the last thing to
be called, in particular after setting the security process label
The code is splattered with a mix of
sizeof foo
sizeof (foo)
sizeof(foo)
Standardize on sizeof(foo) and add a syntax check rule to
enforce it
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Pass argv to the init binary of LXC, using a new <initarg> element.
* docs/formatdomain.html.in: Document <os> usage for containers
* docs/schemas/domaincommon.rng: Add <initarg> element
* src/conf/domain_conf.c, src/conf/domain_conf.h: parsing and
formatting of <initarg>
* src/lxc/lxc_container.c: Setup LXC argv
* tests/Makefile.am, tests/lxcxml2xmldata/lxc-systemd.xml,
tests/lxcxml2xmltest.c, tests/testutilslxc.c,
tests/testutilslxc.h: Test parsing/formatting of LXC related
XML parts
The SELinux mount point moved from /selinux to /sys/fs/selinux
when systemd came along.
* configure.ac: Probe for SELinux mount point
* src/lxc/lxc_container.c: Use SELinux mount point determined
by configure.ac
If no <interface> elements are included in an LXC guest XML
description, then the LXC guest will just see the host's
network interfaces. It is desirable to be able to hide the
host interfaces, without having to define any guest interfaces.
This patch introduces a new feature flag <privnet/> to allow
forcing of a private network namespace for LXC. In the future
I also anticipate that we will add <privuser/> to force a
private user ID namespace.
* src/conf/domain_conf.c, src/conf/domain_conf.h: Add support
for <privnet/> feature. Auto-set <privnet> if any <interface>
devices are defined
* src/lxc/lxc_container.c: Honour request for private network
namespace
This patch fixes the access of variable "con" in two files where the
variable was declared only on SELinux builds and thus the build failed
without SELinux. It's a rather nasty fix but helps fix the build
quickly and without any major changes to the code.
To allow the container to access /dev and /dev/pts when under
sVirt, set an explicit mount option. Also set a max size on
the /dev mount to prevent DOS on memory usage
* src/lxc/lxc_container.c: Set /dev mount context
* src/lxc/lxc_controller.c: Set /dev/pts mount context
For the sake of backwards compat, LXC guests are *not*
confined by default. This is because it is not practical
to dynamically relabel containers using large filesystem
trees. Applications can create confined containers though,
by giving suitable XML configs
* src/Makefile.am: Link libvirt_lxc to security drivers
* src/lxc/libvirtd_lxc.aug, src/lxc/lxc_conf.h,
src/lxc/lxc_conf.c, src/lxc/lxc.conf,
src/lxc/test_libvirtd_lxc.aug: Config file handling for
security driver
* src/lxc/lxc_driver.c: Wire up security driver functions
* src/lxc/lxc_controller.c: Add a '--security' flag to
specify which security driver to activate
* src/lxc/lxc_container.c, src/lxc/lxc_container.h: Set
the process label just before exec'ing init.
Systemd detects containers based on whether they have
an environment variable starting with 'container=lxc';
using a longer name fits the expectations, while also
allowing detection of who created the container.
Requested by Lennart Poettering, in response to
https://bugs.freedesktop.org/show_bug.cgi?id=45175
* src/lxc/lxc_container.c (lxcContainerBuildInitCmd): Add another
env-var.
The current setup code for LXC is bind mounting /dev/pts/ptmx
on top of a character device /dev/ptmx. This is denied by SELinux
policy and is just wrong. The target of a bind mount should just
be a plain file
* src/lxc/lxc_container.c: Don't bind /dev/pts/ptmx onto
a char device
Given an LXC guest with a root filesystem path of
/export/lxc/roots/helloworld/root
During startup, we will pivot the root filesystem to end up
at
/.oldroot/export/lxc/roots/helloworld/root
We then try to open
/.oldroot/export/lxc/roots/helloworld/root/dev/pts
Now consider if '/export/lxc' is an absolute symlink pointing
to '/media/lxc'. The kernel will try to open
/media/lxc/roots/helloworld/root/dev/pts
whereas it should be trying to open
/.oldroot//media/lxc/roots/helloworld/root/dev/pts
To deal with the fact that the root filesystem can be moved,
we need to resolve symlinks in *any* part of the filesystem
source path.
* src/libvirt_private.syms, src/util/util.c,
src/util/util.h: Add virFileResolveAllLinks to resolve
all symlinks in a path
* src/lxc/lxc_container.c: Resolve all symlinks in filesystem
paths during startup
Move the virNetDevSetName and virNetDevSetNamespace APIs out
of LXC's veth.c and into virnetdev.c.
Move the remaining content of the file to src/util/virnetdevveth.c
* src/lxc/veth.c: Rename to src/util/virnetdevveth.c
* src/lxc/veth.h: Rename to src/util/virnetdevveth.h
* src/util/virnetdev.c, src/util/virnetdev.h: Add
virNetDevSetName and virNetDevSetNamespace
* src/lxc/lxc_container.c, src/lxc/lxc_controller.c,
src/lxc/lxc_driver.c: Update include paths
The src/lxc/veth.c file contains APIs for managing veth devices,
but some of the APIs duplicate stuff from src/util/virnetdev.h.
Delete thed duplicate APIs and rename the remaining ones to
follow virNetDevVethXXXX
* src/lxc/veth.c, src/lxc/veth.h: Rename APIs & delete duplicates
* src/lxc/lxc_container.c, src/lxc/lxc_controller.c,
src/lxc/lxc_driver.c: Update for API renaming
Currently the LXC controller only supports setup of a single
text console. This is wired up to the container init's stdio,
as well as /dev/console and /dev/tty1. Extending support for
multiple consoles, means wiring up additional PTYs to /dev/tty2,
/dev/tty3, etc, etc. The LXC controller is passed multiple open
file handles, one for each console requested.
* src/lxc/lxc_container.c, src/lxc/lxc_container.h: Wire up
all the /dev/ttyN links required to symlink to /dev/pts/NN
* src/lxc/lxc_container.h: Open more container side /dev/pts/NN
devices, and adapt event loop to handle I/O from all consoles
* src/lxc/lxc_driver.c: Setup multiple host side PTYs
The LXC code for mounting container filesystems from block devices
tries all filesystems in /etc/filesystems and possibly those in
/proc/filesystems. The regular mount binary, however, first tries
using libblkid to detect the format. Add support for doing the same
in libvirt, since Fedora's /etc/filesystems is missing many formats,
most notably ext4 which is the default filesystem Fedora uses!
* src/Makefile.am: Link libvirt_lxc to libblkid
* src/lxc/lxc_container.c: Probe filesystem format with libblkid
If we looped through /etc/filesystems trying to mount with each
type and failed all options, we forget to actually raise an
error message.
* src/lxc/lxc_container.c: Raise error if unable to detect
the filesystems. Also fix existing error message
The kernel automounter is mostly broken wrt to containers. Most
notably if you start a new filesystem namespace and then attempt
to unmount any autofs filesystem, it will typically fail with a
weird error message like
Failed to unmount '/.oldroot/sys/kernel/security':Too many levels of symbolic links
Attempting to detach the autofs mount using umount2(MNT_DETACH)
will also fail with the same error. Therefore if we get any error on
unmount()ing a filesystem from the old root FS when starting a
container, we must immediately break out and detach the entire
old root filesystem (ignoring any mounts below it).
This has the effect of making the old root filesystem inaccessible
to anything inside the container, but at the cost that the mounts
live on in the kernel until the container exits. Given that SystemD
uses autofs by default, we need LXC to be robust this scenario and
thus this tradeoff is worthwhile.
* src/lxc/lxc_container.c: Detach root filesystem if any umount
operation fails.
The /etc/filesystems file can contain a '*' on the last line to
indicate that /proc/filessystems should be tried next. We have
a check that this '*' only occurs on the last line. Unfortunately
when we then start reading /proc/filesystems, we mistakenly think
we've seen '*' in /proc/filesystems and fail
* src/lxc/lxc_container.c: Skip '*' validation when we're reading
/proc/filesystems
Only some of the return paths of lxcContainerWaitForContinue will
have set errno. In other paths we need to set it manually to avoid
the caller getting a random stale errno value
* src/lxc/lxc_container.c: Set errno in lxcContainerWaitForContinue
Based on a report by Coverity. waitpid() can leak resources if it
fails with EINTR, so it should never be used without checking return
status. But we already have a helper function that does that, so
use it in more places.
* src/lxc/lxc_container.c (lxcContainerAvailable): Use safer
virWaitPid.
* daemon/libvirtd.c (daemonForkIntoBackground): Likewise.
* tests/testutils.c (virtTestCaptureProgramOutput, virtTestMain):
Likewise.
* src/libvirt.c (virConnectAuthGainPolkit): Simplify with virCommand.
When booting a virtual machine with a kernel/initrd it is possible
to pass command line arguments using the <cmdline>...args...</cmdline>
element in the guest XML. These appear to the kernel / init process
in /proc/cmdline.
When booting a container we do not have a custom /proc/cmdline,
but we can easily set an environment variable for it. Ideally
we could pass individual arguments to the init process as a
regular set of 'char *argv[]' parameters, but that would involve
libvirt parsing the <cmdline> XML text. This can easily be added
later, even if we add the env variable now
* docs/drvlxc.html.in: Document env variables passed to LXC
* src/conf/domain_conf.c: Add <cmdline> to be parsed for
guests of type='exe'
* src/lxc/lxc_container.c: Set LIBVIRT_LXC_CMDLINE env var
Hi,
I'm seeing an issue with udev and libvirt-lxc. Libvirt-lxc creates
/dev/ptmx as a symlink to /dev/pts/ptmx. When udev starts up, it
checks the device type, sees ptmx is 'not right', and replaces it
with a 'proper' ptmx.
In lxc, /dev/ptmx is bind-mounted from /dev/pts/ptmx instead of being
symlinked, so udev sees the right device type and leaves it alone.
A patch like the following seems to work for me. Would there be
any objections to this?
>From 4c5035de52de7e06a0de9c5d0bab8c87a806cba7 Mon Sep 17 00:00:00 2001
From: Ubuntu <ubuntu@domU-12-31-39-14-F0-B3.compute-1.internal>
Date: Wed, 31 Aug 2011 18:15:54 +0000
Subject: [PATCH 1/1] make ptmx a bind mount rather than symlink
udev on some systems checks the device type of /dev/ptmx, and replaces it if
not as expected. The symlink created by libvirt-lxc therefore gets replaced.
By creating it as a bind mount, the device type is correct and udev leaves it
alone.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
A previous commit gave the LXC driver the ability to mount
block devices for the container filesystem. Through use of
the loopback device functionality, we can build on this to
support use of plain file images for LXC filesytems.
By setting the LO_FLAGS_AUTOCLEAR flag we can ensure that
the loop device automatically disappears when the container
dies / shuts down
* src/lxc/lxc_container.c: Raise error if we see a file
based filesystem, since it should have been turned into
a loopback device already
* src/lxc/lxc_controller.c: Rewrite any filesystems of
type=file, into type=block, by binding the file image
to a free loop device
Currently the LXC driver can only populate filesystems from
host filesystems, using bind mounts. This patch allows host
block devices to be mounted. It autodetects the filesystem
format at mount time, and adds the block device to the cgroups
ACL. Example usage is
<filesystem type='block' accessmode='passthrough'>
<source dev='/dev/sda1'/>
<target dir='/home'/>
</filesystem>
* src/lxc/lxc_container.c: Mount block device filesystems
* src/lxc/lxc_controller.c: Add block device filesystems
to cgroups ACL
An application container shouldn't get a private /dev. Fix
the regression from 6d37888e6a
* src/lxc/lxc_container.c: Don't mount /dev for app containers