Right now we mount selinuxfs even user namespace is enabled and
ignore the error. But we shouldn't ignore these errors when user
namespace is not enabled.
This patch skips mounting selinuxfs when user namespace enabled.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
If the guest is configured with
<filesystem type='mount'>
<source dir='/'/>
<target dir='/'/>
<readonly/>
</filesystem>
Then any submounts under / should also end up readonly, except
for those setup as basic mounts. eg if the user has /home on a
separate volume, they'd expect /home to be readonly, but we
should not touch the /sys, /proc, etc dirs we setup ourselves.
Users can selectively make sub-mounts read-write again by
simply listing them as new mounts without the <readonly>
flag set
<filesystem type='mount'>
<source dir='/home'/>
<target dir='/home'/>
</filesystem>
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Move the array of basic mounts out of the lxcContainerMountBasicFS
function, to a global variable. This is to allow it to be referenced
by other methods wanting to know what the basic mount paths are.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The devpts, dev and fuse filesystems are mounted temporarily.
there is no need to export them to container if container shares
the root directory with host.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Right now, securityfs is disallowed to be mounted in non-initial
user namespace, so we must avoid trying to mount securityfs in
a container which has user namespace enabled.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
If booting a container with a root FS that isn't the host's
root, we must ensure that the /dev mount point exists.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The lxcContainerMountFSBlockAuto method can be used to mount the
initial root filesystem, so it cannot assume a prefix of /.oldroot.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If securityfs is available on the host, we should ensure to
mount it read-only in the container. This will avoid systemd
trying to mount it during startup causing SELinux AVCs.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
A couple of places in LXC setup for filesystems did not do
a "goto cleanup" after reporting errors. While fixing this,
also add in many more debug statements to aid troubleshooting
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Wire up the new virDomainCreate{XML}WithFiles methods in the
LXC driver, so that FDs get passed down to the init process.
The lxc_container code needs to do a little dance in order
to renumber the file descriptors it receives into linear
order, starting from STDERR_FILENO + 1.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Commit 75c1256 states that virGetGroupList must not be called
between fork and exec, then commit ee777e99 promptly violated
that for lxc.
Patch originally posted by Eric Blake <eblake@redhat.com>.
Otherwise the container will fail to start if we
enable user namespace, since there is no rights to
do mknod in uninit user namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
lxc driver will use this function to change the owner
of hot added devices.
Move virLXCControllerChown to lxc_container.c and Rename
it to lxcContainerChown.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
When failing to start a container due to inaccessible root
filesystem path, we did not log any meaningful error. Add a
few debug statements to assist diagnosis
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=964358
POSIX states that multi-threaded apps should not use functions
that are not async-signal-safe between fork and exec, yet we
were using getpwuid_r and initgroups. Although rare, it is
possible to hit deadlock in the child, when it tries to grab
a mutex that was already held by another thread in the parent.
I actually hit this deadlock when testing multiple domains
being started in parallel with a command hook, with the following
backtrace in the child:
Thread 1 (Thread 0x7fd56bbf2700 (LWP 3212)):
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1 0x00007fd5761e7388 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x00007fd5761e7257 in __pthread_mutex_lock (mutex=0x7fd56be00360)
at pthread_mutex_lock.c:61
#3 0x00007fd56bbf9fc5 in _nss_files_getpwuid_r (uid=0, result=0x7fd56bbf0c70,
buffer=0x7fd55c2a65f0 "", buflen=1024, errnop=0x7fd56bbf25b8)
at nss_files/files-pwd.c:40
#4 0x00007fd575aeff1d in __getpwuid_r (uid=0, resbuf=0x7fd56bbf0c70,
buffer=0x7fd55c2a65f0 "", buflen=1024, result=0x7fd56bbf0cb0)
at ../nss/getXXbyYY_r.c:253
#5 0x00007fd578aebafc in virSetUIDGID (uid=0, gid=0) at util/virutil.c:1031
#6 0x00007fd578aebf43 in virSetUIDGIDWithCaps (uid=0, gid=0, capBits=0,
clearExistingCaps=true) at util/virutil.c:1388
#7 0x00007fd578a9a20b in virExec (cmd=0x7fd55c231f10) at util/vircommand.c:654
#8 0x00007fd578a9dfa2 in virCommandRunAsync (cmd=0x7fd55c231f10, pid=0x0)
at util/vircommand.c:2247
#9 0x00007fd578a9d74e in virCommandRun (cmd=0x7fd55c231f10, exitstatus=0x0)
at util/vircommand.c:2100
#10 0x00007fd56326fde5 in qemuProcessStart (conn=0x7fd53c000df0,
driver=0x7fd55c0dc4f0, vm=0x7fd54800b100, migrateFrom=0x0, stdin_fd=-1,
stdin_path=0x0, snapshot=0x0, vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE,
flags=1) at qemu/qemu_process.c:3694
...
The solution is to split the work of getpwuid_r/initgroups into the
unsafe portions (getgrouplist, called pre-fork) and safe portions
(setgroups, called post-fork).
* src/util/virutil.h (virSetUIDGID, virSetUIDGIDWithCaps): Adjust
signature.
* src/util/virutil.c (virSetUIDGID): Add parameters.
(virSetUIDGIDWithCaps): Adjust clients.
* src/util/vircommand.c (virExec): Likewise.
* src/util/virfile.c (virFileAccessibleAs, virFileOpenForked)
(virDirCreate): Likewise.
* src/security/security_dac.c (virSecurityDACSetProcessLabel):
Likewise.
* src/lxc/lxc_container.c (lxcContainerSetID): Likewise.
* configure.ac (AC_CHECK_FUNCS_ONCE): Check for setgroups, not
initgroups.
Signed-off-by: Eric Blake <eblake@redhat.com>
Recent changes uncovered a NEGATIVE_RETURNS in the return from sysconf()
when processing a for loop in virtTestCaptureProgramExecChild() in
testutils.c
Code review uncovered 3 other code paths with the same condition that
weren't found by Covirity, so fixed those as well.
Convert the type of loop iterators named 'i', 'j', k',
'ii', 'jj', 'kk', to be 'size_t' instead of 'int' or
'unsigned int', also santizing 'ii', 'jj', 'kk' to use
the normal 'i', 'j', 'k' naming
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Create parent directroy for hostdev automatically when we
start a lxc domain or attach a hostdev to a lxc domain.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
This helper function is used to create parent directory for
the hostdev which will be added to the container. If the
parent directory of this hostdev doesn't exist, the mknod of
the hostdev will fail. eg with /dev/net/tun
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
User namespaces will deny the ability to mount the SELinux
filesystem. This is harmless for libvirt's LXC needs, so the
error can be ignored.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
user namespace doesn't allow to create devices in
uninit userns. We should create devices on host side.
We first mount tmpfs on dev directroy under state dir
of container. then create devices under this dev dir.
Finally in container, mount the dev directroy created
on host to the /dev/ directroy of container.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
This patch introduces new helper function
virLXCControllerSetupUserns, in this function,
we set the files uid_map and gid_map of the init
task of container.
lxcContainerSetID is used for creating cred for
tasks running in container. Since after setuid/setgid,
we may be a new user. This patch calls lxcContainerSetUserns
at first to make sure the new created files belong to
right user.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
User namespace will be enabled only when the idmap exist
in configuration.
If you want disable user namespace,just remove these
elements from XML.
If kernel doesn't support user namespace and idmap exist
in configuration file, libvirt lxc will start failed and
return "Kernel doesn't support user namespace" message.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Earlier commit f7e8653f dropped support for using LXC with
kernels having single-instance devpts filesystem from the
LXC controller. It forgot to remove the same code from the
LXC container setup.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
After commit c131525bec
"Auto-add a root <filesystem> element to LXC containers on startup"
for libvirt lxc, root must be existent.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Re-add the selinux header to lxc_container.c since other
functions now use it, beyond the patch that was just
reverted.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Before trying to mount the selinux filesystem in a container
use is_selinux_enabled() to check if the machine actually
has selinux support (eg not booted with selinux=0)
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
During startup, the LXC driver uses paths such as
/.oldroot/var/run/libvirt/lxc/...
to access directories from the previous root filesystem
after doing a pivot_root(). Unfortunately if /var/run
is an absolute symlink to /run, instead of a relative
symlink to ../run, these paths break.
At least one Linux distro is known to use an absolute
symlink for /var/run, so workaround this, by resolving
all symlinks before doing the pivot_root().
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
We do not want to allow contained applications to be able to read fusefs_t.
So we want /proc/meminfo label to match the system default proc_t.
Fix checking of error codes
The lxcContainerMountAllFS method had a 'bool skipRoot'
flag to control whether it mounts the / filesystem. Since
removal of the non-pivot root container setup codepaths,
this flag is obsolete as the only caller always passes
'true'.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Many methods accept a string parameter specifying the
old root directory prefix. Since removal of the non-pivot
root container setup codepaths, this parameter is obsolete
in many methods where the callers always pass "/.oldroot".
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The lxcContainerMountBasicFS method had a 'bool pivotRoot'
flag to control whether it mounted a private /dev. Since
removal of the non-pivot root container setup codepaths,
this flag is obsolete as the only caller always passes
'true'.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The source code base needs to be adapted as well. Some files
include virutil.h just for the string related functions (here,
the include is substituted to match the new file), some include
virutil.h without any need (here, the include is removed), and
some require both.
The LXC driver currently has code to detect cgroups mounts
and then re-mount them inside the new root filesystem. Replace
this fragile code with a call to virCgroupIsolateMount.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If the user requests a mount for /run, this may hide any existing
mounts that are lower down in /run. The result is that the
container still sees the mounts in /proc/mounts, but cannot
access them
sh-4.2# df
df: '/run/user/501/gvfs': No such file or directory
df: '/run/media/berrange/LIVE': No such file or directory
df: '/run/media/berrange/SecureDiskA1': No such file or directory
df: '/run/libvirt/lxc/sandbox': No such file or directory
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/vg_t500wlan-lv_root 151476396 135390200 8384900 95% /
tmpfs 1970888 3204 1967684 1% /run
/dev/sda1 194241 155940 28061 85% /boot
devfs 64 0 64 0% /dev
tmpfs 64 0 64 0% /sys/fs/cgroup
tmpfs 1970888 1200 1969688 1% /etc/libvirt-sandbox/scratch
Before mounting any filesystem at a particular location, we
must recursively unmount anything at or below the target mount
point
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Ensure lxcContainerUnmountSubtree is at the top of the
lxc_container.c file so it is easily referenced from
any other method. No functional change
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
This allows a container-type domain to have exclusive access to one of
the host's NICs.
Wire <hostdev caps=net> with the lxc_controller - when moving the newly
created veth devices into a new namespace, also look for any hostdev
devices that should be moved. Note: once the container domain has been
destroyed, there is no code that moves the interfaces back to the
original namespace. This does happen, though, probably due to default
cleanup on namespace destruction.
Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
Currently the LXC container code has two codepaths, depending on
whether there is a <filesystem> element with a target path of '/'.
If we automatically add a <filesystem> device with src=/ and dst=/,
for any container which has not specified a root filesystem, then
we only need one codepath for setting up the filesystem.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
For a root filesystem with type=file or type=block, the LXC
container was forgetting to actually mount it, before doing
the pivot root step.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently the lxc controller sets up the devpts instance on
$rootfsdef->src, but this only works if $rootfsdef is using
type=mount. To support type=block or type=file for the root
filesystem, we must use /var/lib/libvirt/lxc/$NAME.devpts
for the temporary devpts mount in the controller
Instead of using /var/lib/libvirt/lxc/$NAME for the FUSE
filesystem, use /var/lib/libvirt/lxc/$NAME.fuse. This allows
room for other temporary mounts in the same directory
In the LXC container startup code when switching stdio
streams, we call VIR_FORCE_CLOSE on all FDs. This triggers
a huge number of warnings, but we don't see them because
stdio is closed at this point. strace() however shows them
which can confuse people debugging the code. Switch to
VIR_MASS_CLOSE to avoid this
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
libvirt lxc will fail to start when selinux is disabled.
error: Failed to start domain noroot
error: internal error guest failed to start: PATH=/bin:/sbin TERM=linux container=lxc-libvirt container_uuid=b9873916-3516-c199-8112-1592ff694a9e LIBVIRT_LXC_UUID=b9873916-3516-c199-8112-1592ff694a9e LIBVIRT_LXC_NAME=noroot /bin/sh
2013-01-09 11:04:05.384+0000: 1: info : libvirt version: 1.0.1
2013-01-09 11:04:05.384+0000: 1: error : lxcContainerMountBasicFS:546 : Failed to mkdir /sys/fs/selinux: No such file or directory
2013-01-09 11:04:05.384+0000: 7536: info : libvirt version: 1.0.1
2013-01-09 11:04:05.384+0000: 7536: error : virLXCControllerRun:1466 : error receiving signal from container: Input/output error
2013-01-09 11:04:05.404+0000: 7536: error : virCommandWait:2287 : internal error Child process (ip link del veth1) unexpected exit status 1: Cannot find device "veth1"
fix this problem by checking if selinuxfs is mounted
in host before we try to create dir /sys/fs/selinux.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
when we has no host's src mapped to container.
there is no .oldroot dir,so libvirt lxc will fail
to start when mouting meminfo.
in this case,the parameter srcprefix of function
lxcContainerMountProcFuse should be NULL.and make
this method handle NULL correctly.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Convert the host capabilities and domain config structs to
use the virArch datatype. Update the parsers and all drivers
to take account of datatype change
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
This extends support for host device passthrough with LXC to
cover misc devices. In this case all we need todo is a
mknod in the container's /dev and whitelist the device in
cgroups
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
This extends support for host device passthrough with LXC to
cover storage devices. In this case all we need todo is a
mknod in the container's /dev and whitelist the device in
cgroups
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
This adds support for host device passthrough with the
LXC driver. Since there is only a single kernel image,
it doesn't make sense to pass through PCI devices, but
USB devices are fine. For the latter we merely need to
make the /dev/bus/usb/NNN/MMM character device exist
in the container's /dev
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently LXC guests can be given arbitrary pre-mounted
filesystems, however, for some usecases it is more appropriate
to provide block devices which the container can mount itself.
This first impl only allows for <disk type='block'>, in other
words exposing a host disk device to a container. Since LXC
does not have device namespace virtualization, we are cheating
a little bit. If the XML specifies /dev/sdc4 to be given to
the container as /dev/sda1, when we do the mknod /dev/sda1
in the container's /dev, we actually use the major:minor
number of /dev/sdc4, not /dev/sda1.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
we already have virtualize meminfo for container through fuse filesystem,
add function lxcContainerMountProcFuse to mount this meminfo file to
the container's /proc/meminfo.
So we can isolate container's /proc/meminfo from host now.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Currently the lxcContainerSetupMounts method uses the
virSecurityManagerPtr instance to obtain the mount options
string and then only passes the string down into methods
it calls. As functionality in LXC grows though, those
methods need to have direct access to the virSecurityManagerPtr
instance. So push the code down a level.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The impls of virSecurityManagerGetMountOptions had no way to
return errors, since the code was treating 'NULL' as a success
value. This is somewhat pointless, since the calling code did
not want NULL in the first place and has to translate it into
the empty string "". So change the code so that the impls can
return "" directly, allowing use of NULL for error reporting
once again
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The libvirt coding standard is to use 'function(...args...)'
instead of 'function (...args...)'. A non-trivial number of
places did not follow this rule and are fixed in this patch.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
This needs to be done before the container starts. Turning
off the mknod capability is noticed by systemd, which will
no longer attempt to create device nodes.
This eliminates SELinux AVC messages and ugly failure messages in the journal.
Continue consolidation of process functions by moving some
helpers out of command.{c,h} into virprocess.{c,h}
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Some kernel versions (at least RHEL-6 2.6.32) do not let you over-mount
an existing selinuxfs instance with a new one. Thus we must unmount the
existing instance inside our namespace.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
https://www.gnu.org/licenses/gpl-howto.html recommends that
the 'If not, see <url>.' phrase be a separate sentence.
* tests/securityselinuxhelper.c: Remove doubled line.
* tests/securityselinuxtest.c: Likewise.
* globally: s/; If/. If/
The introduction of /sys/fs/cgroup came in fairly recent kernels.
Prior to that time distros would pick a custom directory like
/cgroup or /dev/cgroup. We need to auto-detect where this is,
rather than hardcoding it
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Otherwise, a build may fail with:
lxc/lxc_conatiner.c: In function 'lxcContainerDropCapabilities':
lxc/lxc_container.c:1662:46: error: unused parameter 'keepReboot' [-Werror=unused-parameter]
* src/lxc/lxc_container.c (lxcContainerDropCapabilities): Mark
parameter unused.
Check whether the reboot() system call is virtualized, and if
it is, then allow the container to keep CAP_SYS_REBOOT.
Based on an original patch by Serge Hallyn
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Per the FSF address could be changed from time to time, and GNU
recommends the following now: (http://www.gnu.org/licenses/gpl-howto.html)
You should have received a copy of the GNU General Public License
along with Foobar. If not, see <http://www.gnu.org/licenses/>.
This patch removes the explicit FSF address, and uses above instead
(of course, with inserting 'Lesser' before 'General').
Except a bunch of files for security driver, all others are changed
automatically, the copyright for securify files are not complete,
that's why to do it manually:
src/security/security_selinux.h
src/security/security_driver.h
src/security/security_selinux.c
src/security/security_apparmor.h
src/security/security_apparmor.c
src/security/security_driver.c
Basically within a Secure Linux Container (virt-sandbox) we want all content
that the process within the container can write to be labeled the same. We
are labeling the physical disk correctly but when we create "RAM" based file
systems
libvirt is not labeling them, and they are defaulting to tmpfs_t, which will
will not allow the processes to write. This patch labels the RAM based file
systems correctly.
Previous commits added code to unmount the existing /proc,
/sys and /dev hierarchies on the root filesystem of the
container. This should only have been done if the container's
root filesystem was the same as the host's root. ie if
the root source is '/'. As it is, this causes LXC containersr
to fail to start if their root source is not '/'
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Move the veth device name state into the virLXCControllerPtr
object and stop passing it around. Also use size_t instead
of unsigned int for the array length parameters.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Since we are mounting a new /dev in the container, we must
remove any sub-mounts like /dev/shm, /dev/mqueue, etc,
otherwise they'll be recorded in /proc/mounts, but not be
accessible to applications.
Currently libvirt-lxc checks to see if the destination exists and is a
directory. If it is not a directory then the mount fails. Since
libvirt-lxc can bind mount files on an inode, this patch is needed to
allow us to bind mount files on files. Currently we want to bind mount
on top of /etc/machine-id, and /etc/adjtime
If the destination of the mount point does not exists, it checks if the
src is a directory and then attempts to create a directory, otherwise it
creates an empty file for the destination. The code will then bind mount
over the destination.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently you can configure LXC to bind a host directory to
a guest directory, but not to bind a guest directory to a
guest directory. While the guest container init could do
this itself, allowing it in the libvirt XML means a stricter
SELinux policy can be written
Introduce a new syntax for filesystems to allow use of a RAM
filesystem
<filesystem type='ram'>
<source usage='10' units='MiB'/>
<target dir='/mnt'/>
</filesystem>
The usage units default to KiB to limit consumption of host memory.
* docs/formatdomain.html.in: Document new syntax
* docs/schemas/domaincommon.rng: Add new attributes
* src/conf/domain_conf.c: Parsing/formatting of RAM filesystems
* src/lxc/lxc_container.c: Mounting of RAM filesystems
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
when lxcContainerIdentifyCGroups failed, the memory it allocated
has been freed, so we should not free this memory again in
lxcContainerSetupPivortRoot and lxcContainerSetupExtraMounts.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
when libvirt_lxc trigger oom error in lxcContainerGetSubtree
we should free the alloced memory for mounts.
so when lxcContainerGetSubtree failed,we should do some
memory cleanup in lxcContainerUnmountSubtree.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
we alloc the memory for format in lxcContainerMountDetectFilesystem
but without free it in lxcContainerMountFSBlockHelper.
this patch just call VIR_FREE to free it.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
This reverts
commit c16b4c43fc
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Fri May 11 15:09:27 2012 +0100
Avoid LXC pivot root in the root source is still /
This commit broke setup of /dev, because the code which
deals with setting up a private /dev and /dev/pts only
works if you do a pivotroot.
The original intent of avoiding the pivot root was to
try and ensure the new root has a minimumal mount
tree. The better way todo this is to just unmount the
bits we don't want (ie old /proc & /sys subtrees.
So apply the logic from
commit c529b47a75
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Fri May 11 11:35:28 2012 +0100
Trim /proc & /sys subtrees before mounting new instances
to the pivot_root codepath as well
when do remount,the source and target should be the same
values specified in the initial mount() call.
So change fs->dst to src.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Normal practice is for cgroups controllers to be mounted at
/sys/fs/cgroup. When setting up a container, /sys is mounted
with a new sysfs instance, thus we must re-mount all the
cgroups controllers. The complexity is that we must mount
them in the same layout as the host OS. ie if 'cpu' and 'cpuacct'
were mounted at the same location in the host we must preserve
this in the container. Also if any controllers are co-located
we must setup symlinks from the individual controller name to
the co-located mount-point
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Both /proc and /sys may have sub-mounts in them from the host
OS. We must explicitly unmount them all before mounting the
new instance over that location. If we don't then /proc/mounts
will show the sub-mounts as existing, even though nothing will
be able to access them, due to the over-mount.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If the LXC config has a filesystem
<filesystem>
<source dir='/'/>
<target dir='/'/>
</filesystem>
then there is no need to go down the pivot root codepath.
We can simply use the existing root as needed.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently to make sysfs readonly, we remount the existing
instance and then bind it readonly. Unfortunately this means
sysfs is still showing device objects wrt the host OS namespace.
We need it to reflect the container namespace, so we must mount
a completely new instance of it. Do the same for selinuxfs since
there is no benefit to bind mounting & this lets us simplify
the code.
* src/lxc/lxc_container.c: Mount fresh sysfs instance
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of hardcoding use of SELinux contexts in the LXC driver,
switch over to using the official security driver API.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Once lxcContainerSetStdio is invoked, logging will not work as
expected in libvirt_lxc. So make sure this is the last thing to
be called, in particular after setting the security process label
The code is splattered with a mix of
sizeof foo
sizeof (foo)
sizeof(foo)
Standardize on sizeof(foo) and add a syntax check rule to
enforce it
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Pass argv to the init binary of LXC, using a new <initarg> element.
* docs/formatdomain.html.in: Document <os> usage for containers
* docs/schemas/domaincommon.rng: Add <initarg> element
* src/conf/domain_conf.c, src/conf/domain_conf.h: parsing and
formatting of <initarg>
* src/lxc/lxc_container.c: Setup LXC argv
* tests/Makefile.am, tests/lxcxml2xmldata/lxc-systemd.xml,
tests/lxcxml2xmltest.c, tests/testutilslxc.c,
tests/testutilslxc.h: Test parsing/formatting of LXC related
XML parts
The SELinux mount point moved from /selinux to /sys/fs/selinux
when systemd came along.
* configure.ac: Probe for SELinux mount point
* src/lxc/lxc_container.c: Use SELinux mount point determined
by configure.ac
If no <interface> elements are included in an LXC guest XML
description, then the LXC guest will just see the host's
network interfaces. It is desirable to be able to hide the
host interfaces, without having to define any guest interfaces.
This patch introduces a new feature flag <privnet/> to allow
forcing of a private network namespace for LXC. In the future
I also anticipate that we will add <privuser/> to force a
private user ID namespace.
* src/conf/domain_conf.c, src/conf/domain_conf.h: Add support
for <privnet/> feature. Auto-set <privnet> if any <interface>
devices are defined
* src/lxc/lxc_container.c: Honour request for private network
namespace
This patch fixes the access of variable "con" in two files where the
variable was declared only on SELinux builds and thus the build failed
without SELinux. It's a rather nasty fix but helps fix the build
quickly and without any major changes to the code.
To allow the container to access /dev and /dev/pts when under
sVirt, set an explicit mount option. Also set a max size on
the /dev mount to prevent DOS on memory usage
* src/lxc/lxc_container.c: Set /dev mount context
* src/lxc/lxc_controller.c: Set /dev/pts mount context
For the sake of backwards compat, LXC guests are *not*
confined by default. This is because it is not practical
to dynamically relabel containers using large filesystem
trees. Applications can create confined containers though,
by giving suitable XML configs
* src/Makefile.am: Link libvirt_lxc to security drivers
* src/lxc/libvirtd_lxc.aug, src/lxc/lxc_conf.h,
src/lxc/lxc_conf.c, src/lxc/lxc.conf,
src/lxc/test_libvirtd_lxc.aug: Config file handling for
security driver
* src/lxc/lxc_driver.c: Wire up security driver functions
* src/lxc/lxc_controller.c: Add a '--security' flag to
specify which security driver to activate
* src/lxc/lxc_container.c, src/lxc/lxc_container.h: Set
the process label just before exec'ing init.
Systemd detects containers based on whether they have
an environment variable starting with 'container=lxc';
using a longer name fits the expectations, while also
allowing detection of who created the container.
Requested by Lennart Poettering, in response to
https://bugs.freedesktop.org/show_bug.cgi?id=45175
* src/lxc/lxc_container.c (lxcContainerBuildInitCmd): Add another
env-var.
The current setup code for LXC is bind mounting /dev/pts/ptmx
on top of a character device /dev/ptmx. This is denied by SELinux
policy and is just wrong. The target of a bind mount should just
be a plain file
* src/lxc/lxc_container.c: Don't bind /dev/pts/ptmx onto
a char device
Given an LXC guest with a root filesystem path of
/export/lxc/roots/helloworld/root
During startup, we will pivot the root filesystem to end up
at
/.oldroot/export/lxc/roots/helloworld/root
We then try to open
/.oldroot/export/lxc/roots/helloworld/root/dev/pts
Now consider if '/export/lxc' is an absolute symlink pointing
to '/media/lxc'. The kernel will try to open
/media/lxc/roots/helloworld/root/dev/pts
whereas it should be trying to open
/.oldroot//media/lxc/roots/helloworld/root/dev/pts
To deal with the fact that the root filesystem can be moved,
we need to resolve symlinks in *any* part of the filesystem
source path.
* src/libvirt_private.syms, src/util/util.c,
src/util/util.h: Add virFileResolveAllLinks to resolve
all symlinks in a path
* src/lxc/lxc_container.c: Resolve all symlinks in filesystem
paths during startup
Move the virNetDevSetName and virNetDevSetNamespace APIs out
of LXC's veth.c and into virnetdev.c.
Move the remaining content of the file to src/util/virnetdevveth.c
* src/lxc/veth.c: Rename to src/util/virnetdevveth.c
* src/lxc/veth.h: Rename to src/util/virnetdevveth.h
* src/util/virnetdev.c, src/util/virnetdev.h: Add
virNetDevSetName and virNetDevSetNamespace
* src/lxc/lxc_container.c, src/lxc/lxc_controller.c,
src/lxc/lxc_driver.c: Update include paths
The src/lxc/veth.c file contains APIs for managing veth devices,
but some of the APIs duplicate stuff from src/util/virnetdev.h.
Delete thed duplicate APIs and rename the remaining ones to
follow virNetDevVethXXXX
* src/lxc/veth.c, src/lxc/veth.h: Rename APIs & delete duplicates
* src/lxc/lxc_container.c, src/lxc/lxc_controller.c,
src/lxc/lxc_driver.c: Update for API renaming
Currently the LXC controller only supports setup of a single
text console. This is wired up to the container init's stdio,
as well as /dev/console and /dev/tty1. Extending support for
multiple consoles, means wiring up additional PTYs to /dev/tty2,
/dev/tty3, etc, etc. The LXC controller is passed multiple open
file handles, one for each console requested.
* src/lxc/lxc_container.c, src/lxc/lxc_container.h: Wire up
all the /dev/ttyN links required to symlink to /dev/pts/NN
* src/lxc/lxc_container.h: Open more container side /dev/pts/NN
devices, and adapt event loop to handle I/O from all consoles
* src/lxc/lxc_driver.c: Setup multiple host side PTYs
The LXC code for mounting container filesystems from block devices
tries all filesystems in /etc/filesystems and possibly those in
/proc/filesystems. The regular mount binary, however, first tries
using libblkid to detect the format. Add support for doing the same
in libvirt, since Fedora's /etc/filesystems is missing many formats,
most notably ext4 which is the default filesystem Fedora uses!
* src/Makefile.am: Link libvirt_lxc to libblkid
* src/lxc/lxc_container.c: Probe filesystem format with libblkid
If we looped through /etc/filesystems trying to mount with each
type and failed all options, we forget to actually raise an
error message.
* src/lxc/lxc_container.c: Raise error if unable to detect
the filesystems. Also fix existing error message
The kernel automounter is mostly broken wrt to containers. Most
notably if you start a new filesystem namespace and then attempt
to unmount any autofs filesystem, it will typically fail with a
weird error message like
Failed to unmount '/.oldroot/sys/kernel/security':Too many levels of symbolic links
Attempting to detach the autofs mount using umount2(MNT_DETACH)
will also fail with the same error. Therefore if we get any error on
unmount()ing a filesystem from the old root FS when starting a
container, we must immediately break out and detach the entire
old root filesystem (ignoring any mounts below it).
This has the effect of making the old root filesystem inaccessible
to anything inside the container, but at the cost that the mounts
live on in the kernel until the container exits. Given that SystemD
uses autofs by default, we need LXC to be robust this scenario and
thus this tradeoff is worthwhile.
* src/lxc/lxc_container.c: Detach root filesystem if any umount
operation fails.
The /etc/filesystems file can contain a '*' on the last line to
indicate that /proc/filessystems should be tried next. We have
a check that this '*' only occurs on the last line. Unfortunately
when we then start reading /proc/filesystems, we mistakenly think
we've seen '*' in /proc/filesystems and fail
* src/lxc/lxc_container.c: Skip '*' validation when we're reading
/proc/filesystems
Only some of the return paths of lxcContainerWaitForContinue will
have set errno. In other paths we need to set it manually to avoid
the caller getting a random stale errno value
* src/lxc/lxc_container.c: Set errno in lxcContainerWaitForContinue
Based on a report by Coverity. waitpid() can leak resources if it
fails with EINTR, so it should never be used without checking return
status. But we already have a helper function that does that, so
use it in more places.
* src/lxc/lxc_container.c (lxcContainerAvailable): Use safer
virWaitPid.
* daemon/libvirtd.c (daemonForkIntoBackground): Likewise.
* tests/testutils.c (virtTestCaptureProgramOutput, virtTestMain):
Likewise.
* src/libvirt.c (virConnectAuthGainPolkit): Simplify with virCommand.
When booting a virtual machine with a kernel/initrd it is possible
to pass command line arguments using the <cmdline>...args...</cmdline>
element in the guest XML. These appear to the kernel / init process
in /proc/cmdline.
When booting a container we do not have a custom /proc/cmdline,
but we can easily set an environment variable for it. Ideally
we could pass individual arguments to the init process as a
regular set of 'char *argv[]' parameters, but that would involve
libvirt parsing the <cmdline> XML text. This can easily be added
later, even if we add the env variable now
* docs/drvlxc.html.in: Document env variables passed to LXC
* src/conf/domain_conf.c: Add <cmdline> to be parsed for
guests of type='exe'
* src/lxc/lxc_container.c: Set LIBVIRT_LXC_CMDLINE env var
Hi,
I'm seeing an issue with udev and libvirt-lxc. Libvirt-lxc creates
/dev/ptmx as a symlink to /dev/pts/ptmx. When udev starts up, it
checks the device type, sees ptmx is 'not right', and replaces it
with a 'proper' ptmx.
In lxc, /dev/ptmx is bind-mounted from /dev/pts/ptmx instead of being
symlinked, so udev sees the right device type and leaves it alone.
A patch like the following seems to work for me. Would there be
any objections to this?
>From 4c5035de52de7e06a0de9c5d0bab8c87a806cba7 Mon Sep 17 00:00:00 2001
From: Ubuntu <ubuntu@domU-12-31-39-14-F0-B3.compute-1.internal>
Date: Wed, 31 Aug 2011 18:15:54 +0000
Subject: [PATCH 1/1] make ptmx a bind mount rather than symlink
udev on some systems checks the device type of /dev/ptmx, and replaces it if
not as expected. The symlink created by libvirt-lxc therefore gets replaced.
By creating it as a bind mount, the device type is correct and udev leaves it
alone.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
A previous commit gave the LXC driver the ability to mount
block devices for the container filesystem. Through use of
the loopback device functionality, we can build on this to
support use of plain file images for LXC filesytems.
By setting the LO_FLAGS_AUTOCLEAR flag we can ensure that
the loop device automatically disappears when the container
dies / shuts down
* src/lxc/lxc_container.c: Raise error if we see a file
based filesystem, since it should have been turned into
a loopback device already
* src/lxc/lxc_controller.c: Rewrite any filesystems of
type=file, into type=block, by binding the file image
to a free loop device
Currently the LXC driver can only populate filesystems from
host filesystems, using bind mounts. This patch allows host
block devices to be mounted. It autodetects the filesystem
format at mount time, and adds the block device to the cgroups
ACL. Example usage is
<filesystem type='block' accessmode='passthrough'>
<source dev='/dev/sda1'/>
<target dir='/home'/>
</filesystem>
* src/lxc/lxc_container.c: Mount block device filesystems
* src/lxc/lxc_controller.c: Add block device filesystems
to cgroups ACL
An application container shouldn't get a private /dev. Fix
the regression from 6d37888e6a
* src/lxc/lxc_container.c: Don't mount /dev for app containers
A container should not be allowed to modify stuff in /sys
or /proc/sys so make them readonly. Make /selinux readonly
so that containers think that selinux is disabled.
Honour the readonly flag when mounting container filesystems
from the guest XML config
* src/lxc/lxc_container.c: Support readonly mounts
Even in non-virtual root filesystem mode we should be mounting
more than just a new /proc. Refactor lxcContainerMountBasicFS
so that it does everything except for /dev and /dev/pts moving
that into lxcContainerMountDevFS. Pass in a source prefix
to lxcContainerMountBasicFS() so it can be used in both shared
root and private root modes.
* src/lxc/lxc_container.c: Unify mounting code for special
filesystems
The bind mount setup is about to get more complicated.
To avoid having to deal with several copies, pull it
out into a separate lxcContainerMountFSBind method.
Also pull out the iteration over container filesystems,
so that it will be easier to drop in support for non-bind
mount filesystems
* src/lxc/lxc_container.c: Pull bind mount code out into
lxcContainerMountFSBind