Since commit 47e5b5ae virCgroupAllowDevice allows to pass -1 as either
the minor or major device number and it automatically uses '*' in place
of that. Reuse the new approach through the code and drop the duplicated
functions.
In the commit 7938b533 we've changed the function signature,
however forgot to update stump that's used on systems without
CGroups causing a build failure.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So, systemd-machined has this philosophy that machine names are like
hostnames and hence should follow the same rules. But we always allowed
international characters in domain names. Thus we need to modify the
machine name we are passing to systemd.
In order to change some machine names that we will be passing to systemd,
we also need to call TerminateMachine at the end of a lifetime of a
domain. Even for domains that were started with older libvirt. That
can be achieved thanks to virSystemdGetMachineNameByPID(). And because
we can change machine names, we can get rid of the inconsistent and
pointless escaping of domain names when creating machine names.
So this patch modifies the naming in the following way. It creates the
name as <drivername>-<id>-<name> where invalid hostname characters are
stripped out of the name and if the resulting name is longer, it
truncates it to 64 characters. That way we can start domains we
couldn't start before. Well, at least on systemd.
To make it work all together, the machineName (which is needed only with
systemd) is saved in domain's private data. That way the generation is
moved to the driver and we don't need to pass various unnecessary
arguments to cgroup functions.
The only thing this complicates a bit is the scope generation when
validating a cgroup where we must check both old and new naming, so a
slight modification was needed there.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1282846
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Use 'ret' for return variable name, clarify use of 'param_idx' and avoid
unnecessary 'success' label. No functional changes. Also document the
function.
In dc576025c3 we renamed virCgroupIsolateMount function to
virCgroupBindMount. However, we forgot about one occurrence in
section of the code which provides stubs for platforms without
support for CGroups like *BSD for instance.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
On the host when we start a container, it will be
placed in a cgroup path of
/machine.slice/machine-lxc\x2ddemo.scope
under /sys/fs/cgroup/*
Inside the containers' namespace we need to setup
/sys/fs/cgroup mounts, and currently will bind
mount /machine.slice/machine-lxc\x2ddemo.scope on
the host to appear as / in the container.
While this may sound nice, it confuses applications
dealing with cgroups, because /proc/$PID/cgroup
now does not match the directory in /sys/fs/cgroup
This particularly causes problems for systems and
will make it create repeated path components in
the cgroup for apps run in the container eg
/machine.slice/machine-lxc\x2ddemo.scope/machine.slice/machine-lxc\x2ddemo.scope/user.slice/user-0.slice/session-61.scope
This also causes any systemd service that uses
sd-notify to fail to start, because when systemd
receives the notification it won't be able to
identify the corresponding unit it came from.
In particular this break rabbitmq-server startup
Future kernels will provide proper cgroup namespacing
which will handle this problem, but until that time
we should not try to play games with hiding parent
cgroups.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
As cgroup implementation only works on Linux, it does not
make much sense to include sys/mount.h if other requirements are
not met, such as HAVE_MNTENT_H and HAVE_GETMNTENT_R.
Also, it fixes build on OpenBSD that requires to include sys/param.h
along with sys/mount.h.
Signed-off-by: Roman Bogorodskiy <bogorodskiy@gmail.com>
The manpage for sysconf() suggest including unistd.h as the
function is declared there. Even though we are not hitting any
compile issues currently, let's include the correct header file
instead of relying on some hidden include chain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
virCgroupNewMachine used to add the pidleader to the newly created
machine cgroup. Do not do this implicit anymore.
Signed-off-by: Henning Schild <henning.schild@siemens.com>
Commit 89c509a0 added getters for cgroup block device I/O throttling,
however stub versions of these functions have not matching function
prototypes that result in compilation fail on platforms not supporting
cgroup.
Fix build by correcting prototypes of the stubbed functions.
Pushing under build-breaker rule.
This function translates device paths to "major:minor " string, and all
virCgroupSetBlkioDevice* functions are modified to use it. It's a
cleanup with no functional change.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
The scope name, even according to our docs is
"machine-$DRIVER\x2d$VMNAME.scope" virSystemdMakeScopeName would use the
resource partition name instead of "machine-" if it was specified thus
creating invalid scope paths.
This makes libvirt drop cgroups for a VM that uses custom resource
partition upon reconnecting since the detected scope name would not
match the expected name generated by virSystemdMakeScopeName.
The error is exposed by the following log entry:
debug : virCgroupValidateMachineGroup:302 : Name 'machine-qemu\x2dtestvm.scope' for controller 'cpu' does not match 'testvm', 'testvm.libvirt-qemu' or 'machine-test-qemu\x2dtestvm.scope'
for a "/machine/test" resource and "testvm" vm.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1238570
Create a new common API to replace the virCgroupNew{Vcpu|Emulator|IOThread}
API's using an emum to generate the cgroup name
Signed-off-by: John Ferlan <jferlan@redhat.com>
This new internal API checks if given CGroup controller is
available. It is going to be needed later when we need to make a
decision whether pin domain memory onto NUMA nodes using cpuset
CGroup controller or using numa_set_membind().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When creating new internal representation of cgroups, all passed
arguments are logged. Well, except for two: pid and pointer for
return value. Lets log them too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The function has no argument named @name rather than @path
instead. The comment is, however, referring to @name while it
should have been referring to @path really.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Commit id '2dbfa716' exposed virCgroupDetectMountsFromFile, but did not
add the corresponding entry in the "#else /* !VIR_CGROUP_SUPPORTED */"
section of the module.
Commit id 'ba1dfc5' added virCgroupSetCpusetMemoryMigrate and
virCgroupGetCpusetMemoryMigrate, but did not add the corresponding
entry points into the "#else /* !VIR_CGROUP_SUPPORTED */" section
A helper that never returns an error and treats bits out of bitmap range
as false.
Use it everywhere we use ignore_value on virBitmapGetBit, or loop over
the bitmap size.
Coverity reports that my commit af1c98e introduced
two memory leaks:
the cpumap if ncpus == 0 in virCgroupGetPercpuStats
and the params array in the test of the function.
Per-cpu stats are only shown for present CPUs in the cgroups,
but we were only parsing the largest CPU number from
/sys/devices/system/cpu/present and looking for stats even for
non-present CPUs.
This resulted in:
internal error: cpuacct parse error
systemd-machined introduced a new method CreateMachineWithNetwork
that obsoletes CreateMachine. It expects to be given a list of
VETH/TAP device indexes for the host side device(s) associated
with a container/machine.
This falls back to the old CreateMachine method when the new
one is not supported.
Commit 1a80b97d, which added the virCgroupHasEmptyTasks() function
forgot that the parameter @cgroup may be NULL and did not check that.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
When calling virCgroupAllowAllDevices we get these invalid entries
in the device cgroup config.
b -1:-1 rw
c -1:-1 rw
Check for positive values before outputting the major and minor to
avoid that.
I noticed this while working on qemuDomainGetBlockInfo. Assigning
a bool value to an int variable compiles fine, but raises red flags
on the maintenance front as it becomes too easy to assign -1 or 2
or any other non-bool value to the same variable.
* cfg.mk (sc_prohibit_int_assign_bool): New rule.
* src/conf/snapshot_conf.c (virDomainSnapshotRedefinePrep): Fix
offenders.
* src/qemu/qemu_driver.c (qemuDomainGetBlockInfo)
(qemuDomainSnapshotCreateXML): Likewise.
* src/test/test_driver.c (testDomainSnapshotAlignDisks):
Likewise.
* src/util/vircgroup.c (virCgroupSupportsCpuBW): Likewise.
* src/util/virpci.c (virPCIDeviceBindToStub): Likewise.
* src/util/virutil.c (virIsCapableVport): Likewise.
* tools/virsh-domain-monitor.c (cmdDomMemStat): Likewise.
* tools/virsh-domain.c (cmdBlockResize, cmdScreenshot)
(cmdInjectNMI, cmdSendKey, cmdSendProcessSignal)
(cmdDetachInterface): Likewise.
Signed-off-by: Eric Blake <eblake@redhat.com>
If we don't properly clean up all processes in the
machine-<vmname>.scope systemd won't remove the cgroup and subsequent vm
starts fail with
'CreateMachine: File exists'
Additional processes can e.g. be added via
echo $PID > /sys/fs/cgroup/systemd/machine.slice/machine-${VMNAME}.scope/tasks
but there are other cases like
http://bugs.debian.org/761521
Invoke TerminateMachine to be on the safe side since systemd tracks the
cgroup anyway. This is a noop if all processes have terminated already.
Added <capabilities> in the <features> section of LXC domains
configuration. This section can contain elements named after the
capabilities like:
<mknod state="on"/>, keep CAP_MKNOD capability
<sys_chroot state="off"/> drop CAP_SYS_CHROOT capability
Users can restrict or give more capabilities than the default using
this mechanism.
Commit a48f445100 introduced a helper
function to convert cgroup device mode to string. The function was only
conditionally compiled on platforms that support cgroup. This broke the
build when attempting to export the symbol:
CCLD libvirt.la
Cannot export virCgroupGetDevicePermsString: symbol not defined
Move the function out of the ifdef, as it doesn't really depend on the
cgroup code being present.
Cgroups code uses VIR_CGROUP_DEVICE_* flags to specify the mode but in
the end it needs to be converted to a string. Add a helper to do it and
use it in the cgroup code before introducing it into the rest of the
code.
In making the conversion to the new API, I fixed a couple bugs:
virSCSIDeviceGetSgName would leak memory if a directory
unexpectedly contained multiple entries;
virNetDevTapGetRealDeviceName could report a spurious error
from a stale errno inherited before starting the readdir search.
The decision on whether to store the result of virDirRead into
a variable is based on whether the end of the loop falls through
to cleanup code automatically. In some cases, we have loops that
are documented to return NULL on failure, and which raise an
error on most failure paths but not in the case where the directory
was unexpectedly empty; it may be worth a followup patch to
explicitly report an error if readdir was successful but the
directory was empty, so that a NULL return always has an error set.
* src/util/vircgroup.c (virCgroupRemoveRecursively): Use new
interface.
(virCgroupKillRecursiveInternal, virCgroupSetOwner): Report
readdir failures.
* src/util/virfile.c (virFileLoopDeviceOpenSearch)
(virFileNBDDeviceFindUnused, virFileDeleteTree): Use new
interface.
* src/util/virnetdevtap.c (virNetDevTapGetRealDeviceName):
Properly check readdir errors.
* src/util/virpci.c (virPCIDeviceIterDevices)
(virPCIDeviceFileIterate, virPCIGetNetName): Report readdir
failures.
(virPCIDeviceAddressIOMMUGroupIterate): Use new interface.
* src/util/virscsi.c (virSCSIDeviceGetSgName): Report readdir
failures, and avoid memory leak.
(virSCSIDeviceGetDevName): Report readdir failures.
* src/util/virusb.c (virUSBDeviceSearch): Report readdir
failures.
* src/util/virutil.c (virGetFCHostNameByWWN)
(virFindFCHostCapableVport): Report readdir failures.
Signed-off-by: Eric Blake <eblake@redhat.com>
The iterator is checked for being less than or equal to need_cpus.
The 'n' variable is incremented need_cpus + 1 times.
Simplify the computation of need_cpus and make its value one larger,
to let it be used instead of 'n' and compared without the equal sign
in loop conditions.
Just index the sum_cpu_time array instead of using a helper variable.
Start the loop at start_cpu instead of continuing for all lower values.
total_cpus is the total number of CPUs on the host
need_cpus is the number of CPUs we need to look at
(need_cpus can be larger than ncpus, because we need to look
at CPUs before the startcpu too, even if we aren't reporting
their stats)
Currently, virCgroupGetPercpuStats is only used by the LXC driver,
filling out the CPUTIME stats. qemuDomainGetPercpuStats does this
and also filles out VCPUTIME stats.
Extend virCgroupGetPercpuStats to also report VCPUTIME stats if
nvcpupids is non-zero. In the LXC driver, we don't have cpupids.
In the QEMU driver, there is at least one cpupid for a running domain,
so the behavior shouldn't change for QEMU either.
Also rename getSumVcpuPercpuStats to virCgroupGetPercpuVcpuSum.
When I start multi VMs coincidently and any of the cgroup directories
named machine doesn't exist. There's a chance that VM start failed because
of creating directory failed:
Unable to initialize /machine cgroup: File exists
When the errno returned by mkdir in virCgroupMakeGroup is EEXIST,
we should pass it through and continue to start the VM.
Signed-off-by: Wang Yufei <james.wangyufei@huawei.com>
Any source file which calls the logging APIs now needs
to have a VIR_LOG_INIT("source.name") declaration at
the start of the file. This provides a static variable
of the virLogSource type.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Commit a1cbe4b5 added a check for spaces around assignments and this
patch extends it to checks for spaces around '=='. One exception is
virAssertCmpInt where comma after '==' is acceptable (since it is a
macro and '==' is its argument).
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Running ./autobuild.sh detected a mingw failure:
CCLD libvirt.la
Cannot export virCgroupGetPercpuStats: symbol not defined
Cannot export virCgroupSetOwner: symbol not defined
* src/util/vircgroup.c (virCgroupGetPercpuStats)
(virCgroupSetOwner): Implement stubs.
Signed-off-by: Eric Blake <eblake@redhat.com>
This function is needed for user namespaces, where we need to chmod()
the cgroup to the initial uid/gid such that systemd is allowed to
use the cgroup.
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Commit 6515889 broke the build on FreeBSD:
In function `qemuDomainGetCPUStats':
/../../src/qemu/qemu_driver.c:16102:
undefined reference to `virCgroupGetDomainTotalCpuStats'
This patch introduces virCgroupSetBlkioDeviceReadIops,
virCgroupSetBlkioDeviceWriteIops,
virCgroupSetBlkioDeviceReadBps and
virCgroupSetBlkioDeviceWriteBps,
we can use these interfaces to set up throttle
blkio cgroup for domain.
This patch also adds the new throttle blkio cgroup
elements to the test xml.
Signed-off-by: Guan Qiang <hzguanqiang@corp.netease.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Since kernel 3.12 (commit 34ff8dc08956098563989d8599840b130be81252 in
linux-stable.git in particular) the value for 'unlimited' in cgroup
memory limits changed from LLONG_MAX to ULLONG_MAX. Due to rather
unfortunate choice of our VIR_DOMAIN_MEMORY_PARAM_UNLIMITED constant
(which we transfer as an unsigned long long in Kibibytes), we ended up
with the situation described below (applies to x86_64):
- 2^64-1 (ULLONG_MAX) -- "unlimited" in kernel = 3.12
- 2^63-1 (LLONG_MAX) -- "unlimited" in kernel < 3.12
- 2^63-1024 -- our PARAM_UNLIMITED scaled to Bytes
- 2^53-1 -- our PARAM_UNLIMITED unscaled (in Kibibytes)
This means that when any number within (2^63-1, 2^64-1] is read from
memory cgroup, we are transferring that number instead of "unlimited".
Unfortunately, changing VIR_DOMAIN_MEMORY_PARAM_UNLIMITED would break
ABI compatibility and thus we have to resort to a different solution.
With this patch every value greater than PARAM_UNLIMITED means
"unlimited". Even though this may seem misleading, we are already in
such unclear situation when running 3.12 kernel with memory limits set
to 2^63.
One example showing most of the problems at once (with kernel 3.12.2):
# virsh memtune asdf --hard-limit 9007199254740991 --swap-hard-limit -1
# echo 12345678901234567890 >\
/sys/fs/cgroup/memory/machine/asdf.libvirt-qemu/memory.soft_limit_in_bytes
# virsh memtune asdf
hard_limit : 18014398509481983
soft_limit : 12056327051986884
swap_hard_limit: 18014398509481983
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
The range of valid values for cgroup tunables has
changed in the past and may change again in future
kernels. Avoid hardcoding range checks in libvirt
code, delegating range checking to the kernel itself.
Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
When EINVAL is returned while changing a cgroups value, tell
user that what values are invalid for the field.
Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
Debian systems may run the 'systemd-logind' daemon, which causes the
/sys/fs/cgroup/systemd mount to be setup, but no other cgroup
controllers are created. While the LXC driver considers cgroups to
be mandatory, the QEMU driver is supposed to accept them as optional.
We detect whether they are present by looking in /proc/mounts for
any mounts of type 'cgroups', but this is not sufficient. We need to
skip any named mounts (as seen by a name=XXX string in the mount
options), so that we only detect actual resource controllers.
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=721979
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Some users in Ubuntu/Debian seem to have a setup where all the
cgroup controllers are mounted on /sys/fs/cgroup rather than
any /sys/fs/cgroup/<controller> name. In the loop which detects
which controllers are present for a mount point we were modifying
'mnt_dir' field in the 'struct mntent' var, but not always restoring
the original value. This caused detection to break in the all-in-one
mount setup.
Fix that logic bug and add test case coverage for this mount
setup.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
- Convert virCgroupGet* to VIR_CGROUP_SUPPORTED
- Convert virCgroup(Get|Set)FreezerState to VIR_CGROUP_SUPPORTED
Signed-off-by: Eric Blake <eblake@redhat.com>
- Introduce VIR_CGROUP_SUPPORTED conditional
- Convert virCgroupKill* to use it
- Convert virCgroupIsolateMount() to use it
- Convert virCgroupRemoveRecursively to VIR_CGROUP_SUPPORTED
Signed-off-by: Eric Blake <eblake@redhat.com>
Make future patches smaller by matching a sane header listing in
the first place. No semantic change.
* src/util/vircgroup.h: Move free next to new, and controller
functions next to each other.
* src/util/vircgroup.c (virCgroupFree, virCgroupHasController)
(virCgroupPathOfController, virCgroupRemoveRecursively)
(virCgroupRemove): Sort implementation to be closer to header.
Signed-off-by: Eric Blake <eblake@redhat.com>
Avoid a forward declaration of a static function.
* src/util/vircgroup.c (virCgroupPartitionNeedsEscaping)
(virCgroupParticionEscape): Move up.
Signed-off-by: Eric Blake <eblake@redhat.com>
Format all functions with two blank lines between, and return type
on separate line from function name. Also break some lines longer
than 80 columns. This makes the subsequent macro refactoring
less noisy.
* src/util/vircgroup.c: Match prevailing style.
Signed-off-by: Eric Blake <eblake@redhat.com>
Make the virCgroupNewMachine method try to use systemd-machined
first. If that fails, then fallback to using the traditional
cgroup setup code path.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
When systemd is involved in managing processes, it may start
killing off & tearing down croups associated with the process
while we're still doing virCgroupKillPainfully. We must
explicitly check for ENOENT and treat it as if we had finished
killing processes
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Systemd uses a named cgroup mount for tracking processes. Add
it as another type of controller, albeit one which we have to
special case in a number of places. In particular we must
never create/delete directories there, nor add tasks. Essentially
the systemd mount is to be considered read-only for libvirt.
With this change both the virCgroupDetectPlacement and
virCgroupCopyPlacement methods must be invoked. The copy
placement method will copy setup for resource controllers
only. The detect placement method will probe for any
named controllers, or resource controllers not already
setup.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The previous patch was incomplete.
CC libvirt_util_la-vircgroup.lo
../../src/util/vircgroup.c:70:12: error: 'virCgroupPartitionEscape' declared 'static' but never defined [-Werror=unused-function]
static int virCgroupPartitionEscape(char **path);
^
* src/util/vircgroup.c (virCgroupPartitionEscape): Move forward
declaration inside conditional.
Signed-off-by: Eric Blake <eblake@redhat.com>
The virCgroupValidateMachineGroup method calls some functions
which are only conditionally compiled, thus it too must be
made conditional. This fixes the build on non-Linux hosts.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If the app has provided a whitelist of controllers to be used,
we skip detecting its mount point. We still, however, fill in
the placement info which later confuses the machine name
validation code. Skip detecting placement if the controller
mount point is not set
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
When a VM has an 'emulator' child cgroup present, we must
strip off that suffix when detecting the cgroup for a
machine
Rename the virCgroupIsValidMachineGroup method to
virCgroupValidateMachineGroup to make a bit clearer
that this isn't simply a boolean check, it will make
changes to the object.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The virCgroupIsValidMachine does not need to be called from
outside the cgroups file now, so make it static.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of requiring drivers to use a combination of calls
to virCgroupNewDetect and virCgroupIsValidMachine, combine
the two into virCgroupNewDetectMachine
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Add protection such that the virCgroupRemove and
virCgroupKill* do not do anything to the root cgroup.
Killing all PIDs in the root cgroup does not end well.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of requiring one API call to create a cgroup and
another to add a task to it, introduce a new API
virCgroupNewMachine which does both jobs at once. This
will facilitate the later code to talk to systemd to
achieve this job which is also atomic.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
virCgroupAvailable() implementation calls getmntent_r
without checking if HAVE_GETMNTENT_R is defined, so it fails
to build on platforms without getmntent_r support.
Make virCgroupAvailable() just return false without
HAVE_GETMNTENT_R.
The virCgroupNewDomainDriver and virCgroupNewDriver methods
are obsolete now that we can auto-detect existing cgroup
placement. Delete them to reduce code bloat.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Add virCgroupIsValidMachine API to check whether an auto
detected cgroup is valid for a machine. This lets us
check if a VM has just been placed into some generic
shared cgroup, or worse, the root cgroup
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Add a virCgroupNewDetect API which is used to initialize a
cgroup object with the placement of an arbitrary process.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Convert the remaining methods in vircgroup.c to report errors
instead of returning errno values.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reuse the buffer for getline and track buffer allocation
separately from the string length to prevent unlikely
out-of-bounds memory access.
This fixes the following leak that happened when zero bytes were read:
==404== 120 bytes in 1 blocks are definitely lost in loss record 1,344 of 1,671
==404== at 0x4C2C71B: malloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==404== by 0x906F862: getdelim (iogetdelim.c:68)
==404== by 0x52A48FB: virCgroupPartitionNeedsEscaping (vircgroup.c:1136)
==404== by 0x52A0FB4: virCgroupPartitionEscape (vircgroup.c:1171)
==404== by 0x52A0EA4: virCgroupNewDomainPartition (vircgroup.c:1450)