As documented in linux.git/Documentation/cgroups/cpuacct.txt,
cpuacct.stat returns user and system time in ticks (the same
unit used in times(2)). It would be a bit nicer if it were like
getrusage(2) and reported timeval contents, or like cpuacct.usage
and in nanoseconds, but we can't be picky.
* src/util/cgroup.h (virCgroupGetCpuacctStat): New function.
* src/util/cgroup.c (virCgroupGetCpuacctStat): Implement it.
(virCgroupGetValueStr): Allow for multi-line files.
* src/libvirt_private.syms (cgroup.h): Export it.
* For now, only "cpu_time" is supported.
* cpuacct cgroup is used for providing percpu cputime information.
* src/qemu/qemu.conf - take care of cpuacct cgroup.
* src/qemu/qemu_conf.c - take care of cpuacct cgroup.
* src/qemu/qemu_driver.c - added an interface
* src/util/cgroup.c/h - added interface for getting percpu cputime
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
No thanks to 64-bit windows, with 64-bit pid_t, we have to avoid
constructs like 'int pid'. Our API in libvirt-qemu cannot be
changed without breaking ABI; but then again, libvirt-qemu can
only be used on systems that support UNIX sockets, which rules
out Windows (even if qemu could be compiled there) - so for all
points on the call chain that interact with this API decision,
we require a different variable name to make it clear that we
audited the use for safety.
Adding a syntax-check rule only solves half the battle; anywhere
that uses printf on a pid_t still needs to be converted, but that
will be a separate patch.
* cfg.mk (sc_correct_id_types): New syntax check.
* src/libvirt-qemu.c (virDomainQemuAttach): Document why we didn't
use pid_t for pid, and validate for overflow.
* include/libvirt/libvirt-qemu.h (virDomainQemuAttach): Tweak name
for syntax check.
* src/vmware/vmware_conf.c (vmwareExtractPid): Likewise.
* src/driver.h (virDrvDomainQemuAttach): Likewise.
* tools/virsh.c (cmdQemuAttach): Likewise.
* src/remote/qemu_protocol.x (qemu_domain_attach_args): Likewise.
* src/qemu_protocol-structs (qemu_domain_attach_args): Likewise.
* src/util/cgroup.c (virCgroupPidCode, virCgroupKillInternal):
Likewise.
* src/qemu/qemu_command.c(qemuParseProcFileStrings): Likewise.
(qemuParseCommandLinePid): Use pid_t for pid.
* daemon/libvirtd.c (daemonForkIntoBackground): Likewise.
* src/conf/domain_conf.h (_virDomainObj): Likewise.
* src/probes.d (rpc_socket_new): Likewise.
* src/qemu/qemu_command.h (qemuParseCommandLinePid): Likewise.
* src/qemu/qemu_driver.c (qemudGetProcessInfo, qemuDomainAttach):
Likewise.
* src/qemu/qemu_process.c (qemuProcessAttach): Likewise.
* src/qemu/qemu_process.h (qemuProcessAttach): Likewise.
* src/uml/uml_driver.c (umlGetProcessInfo): Likewise.
* src/util/virnetdev.h (virNetDevSetNamespace): Likewise.
* src/util/virnetdev.c (virNetDevSetNamespace): Likewise.
* tests/testutils.c (virtTestCaptureProgramOutput): Likewise.
* src/conf/storage_conf.h (_virStoragePerms): Use mode_t, uid_t,
and gid_t rather than int.
* src/security/security_dac.c (virSecurityDACSetOwnership): Likewise.
* src/conf/storage_conf.c (virStorageDefParsePerms): Avoid
compiler warning.
On RHEL5, I got:
util/virrandom.c:66: warning: nested extern declaration of '_gl_verify_function66' [-Wnested-externs]
The fix is to hoist the verify earlier. Also some other hodge-podge
fixes I noticed while reviewing Dan's recent series.
* .gitignore: Ignore new test.
* src/util/cgroup.c: Bump copyright year.
* src/util/virhash.c: Fix typo in description.
* src/util/virrandom.c (virRandomBits): Mark doc comment, and
hoist assert to silence older gcc.
Recent discussions have illustrated the potential for DOS attacks
with the hash table implementations used by most languages and
libraries.
https://lwn.net/Articles/474912/
libvirt has an internal hash table impl, and uses hash tables for
a variety of purposes. The hash key generation code is pretty
simple and thus not strongly collision resistant.
This patch replaces the current libvirt hash key generator with
the (public domain) Murmurhash3 code. In addition every hash
table now gets a random seed value which is used to perturb the
hashing code. This should make it impossible to mount any
practical attack against libvirt hashing code.
* bootstrap.conf: Import bitrotate module
* src/Makefile.am: Add virhashcode.[ch]
* src/util/util.c: Make virRandom() return a fixed 32 bit
integer value.
* src/util/hash.c, src/util/hash.h, src/util/cgroup.c: Replace
hash code generation with a call to virHashCodeGen()
* src/util/virhashcode.h, src/util/virhashcode.c: Add a new
virHashCodeGen() API using the Murmurhash3 algorithm.
In preparation for the patch to include Murmurhash3, which
introduces a virhashcode.h and virhashcode.c files, rename
the existing hash.h and hash.c to virhash.h and virhash.c
respectively.
In preparation for conversion over to use the Murmurhash3
algorithm, convert various virHash APIs to use size_t or
uint32 for their return values/parameters, instead of the
variable size 'unsigned long' or 'int' types
It is possible (expected/likely in Fedora 15) for a cgroup controller
to be mounted in multiple locations at the same time, due to bind
mounts. Currently we leak memory if this happens, because we overwrite
the previous 'mountPoint' string. Instead just accept the first match
we find.
* src/util/cgroup.c: Only accept first match for a cgroup
controller mount
Coverity noted that most clients reacted to failure to hash; but in
a best-effort kill loop, we can ignore failure.
* src/util/cgroup.c (virCgroupKillInternal): Ignore hash failure.
These VIR_XXXX0 APIs make us confused, use the non-0-suffix APIs instead.
How do these coversions works? The magic is using the gcc extension of ##.
When __VA_ARGS__ is empty, "##" will swallow the "," in "fmt," to
avoid compile error.
example: origin after CPP
high_level_api("%d", a_int) low_level_api("%d", a_int)
high_level_api("a string") low_level_api("a string")
About 400 conversions.
8 special conversions:
VIR_XXXX0("") -> VIR_XXXX("msg") (avoid empty format) 2 conversions
VIR_XXXX0(string_literal_with_%) -> VIR_XXXX(%->%%) 0 conversions
VIR_XXXX0(non_string_literal) -> VIR_XXXX("%s", non_string_literal)
(for security) 6 conversions
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Clang detected a dead store to rc. It turns out that in fixing this,
I also found a FILE* leak.
This is a subtle change in behavior, although unlikely to hit. The
pidfile is a kernel file, so we've probably got more serious problems
under foot if we fail to parse one. However, the previous behavior
was that even if one pid file failed to parse, we tried others,
whereas now we give up on the first failure. Either way, though,
the function returns -1, so the caller will know that something is
going wrong, and that not all pids were necessarily reaped. Besides,
there were other instances already in the code where failure in the
inner loop aborted the outer loop.
* src/util/cgroup.c (virCgroupKillInternal): Abort rather than
resuming loop on fscanf failure, and cleanup file on error.
This patch enables cgroup controllers as much as possible by skipping
the creation of blkio controller when running with old kernels that
doesn't support multi-level directory for blkio controller.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
* Correct the documentation for cgroup: the swap_hard_limit indicates
mem+swap_hard_limit.
* Change cgroup private apis to: virCgroupGet/SetMemSwapHardLimit
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Adding audit points showed that we were granting too much privilege
to qemu; it should not need any mknod rights to recreate any
devices. On the other hand, lxc should have all device privileges.
The solution is adding a flag parameter.
This also lets us restrict write access to read-only disks.
* src/util/cgroup.h (virCgroup*Device*): Adjust prototypes.
* src/util/cgroup.c (virCgroupAllowDevice)
(virCgroupAllowDeviceMajor, virCgroupAllowDevicePath)
(virCgroupDenyDevice, virCgroupDenyDeviceMajor)
(virCgroupDenyDevicePath): Add parameter.
* src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update clients.
* src/lxc/lxc_controller.c (lxcSetContainerResources): Likewise.
* src/qemu/qemu_cgroup.c: Likewise.
(qemuSetupDiskPathAllow): Also, honor read-only disks.
Although the cgroup device ACL controller path can be worked out
by researching the code, it is more efficient to include that
information directly in the audit message.
* src/util/cgroup.h (virCgroupPathOfController): New prototype.
* src/util/cgroup.c (virCgroupPathOfController): Export.
* src/libvirt_private.syms: Likewise.
* src/qemu/qemu_audit.c (qemuAuditCgroup): Use it.
On cygwin:
CC libvirt_util_la-cgroup.lo
util/cgroup.c: In function 'virCgroupKillRecursiveInternal':
util/cgroup.c:1458: warning: implicit declaration of function 'virCgroupNew' [-Wimplicit-function-declaration]
* src/util/cgroup.c (virCgroupKill): Don't build on platforms
where virCgroupNew is unsupported.
The kill() function doesn't exist on Win32, so it needs to be
checked for at build time & code disabled in cgroups
* configure.ac: Check for kill()
* src/util/cgroup.c: Stub out virCGroupKill* functions
when kill() isn't available
The virCgroupKill method kills all PIDs found in a cgroup
The virCgroupKillRecursively method does this recursively
for child cgroups.
The virCgroupKillPainfully method does a recursive kill
several times in a row until everything has really died
* src/util/cgroup.c (virCgroupSetValueStr, virCgroupGetValueStr)
(virCgroupRemoveRecursively): VIR_DEBUG can clobber errno.
(virCgroupRemove): Use VIR_DEBUG rather than DEBUG.
clang complained that STREQ(group->controllers[i].mountPoint,...) was
a NULL dereference when i==VIR_CGROUP_CONTROLLER_CPUSET, because it
assumes the worst about virCgroupPathOfController. Marking the
argument const doesn't yet have an effect, per this clang bug:
http://llvm.org/bugs/show_bug.cgi?id=7758
So, we use sa_assert, which was designed to shut up false positives
from tools like clang.
* src/util/cgroup.c (virCgroupMakeGroup): Teach clang that there
is no NULL dereference.
Display or set unlimited values for memory parameters. Unlimited is
represented by INT64_MAX in memory cgroup.
Signed-off-by: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Reported-by: Justin Clift <jclift@redhat.com>
This patch adds a mode_t parameter to virFileWriteStr().
If mode is different from 0, virFileWriteStr() will try
to create the file if it doesn't exist.
* src/util/util.h (virFileWriteStr): Alter signature.
* src/util/util.c (virFileWriteStr): Allow file creation.
* src/network/bridge_driver.c (networkEnableIpForwarding)
(networkDisableIPV6): Adjust clients.
* src/node_device/node_device_driver.c
(nodeDeviceVportCreateDelete): Likewise.
* src/util/cgroup.c (virCgroupSetValueStr): Likewise.
* src/util/pci.c (pciBindDeviceToStub, pciUnBindDeviceFromStub):
Likewise.
Similarly to deprecating close(), I am now deprecating fclose() and
introduce VIR_FORCE_FCLOSE() and VIR_FCLOSE(). Also, fdopen() is replaced with
VIR_FDOPEN().
Most of the files are opened in read-only mode, so usage of
VIR_FORCE_CLOSE() seemed appropriate. Others that are opened in write
mode already had the fclose()< 0 check and I converted those to
VIR_FCLOSE()< 0.
I did not find occurrences of possible double-closed files on the way.
When we mount any cgroup without "-o devices", we will fail to start vms:
error: Failed to start domain vm1
error: Unable to deny all devices for vm1: No such file or directory
When we mount any cgroup without "-o cpu", we will fail to get schedinfo:
Scheduler : posix
error: unable to get cpu shares tunable: No such file or directory
We should only use the cgroup controllers which are mounted on host.
So I add virCgroupMounted() for qemuCgroupControllerActive()
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
As pointed out by Eric Blake, using dirent->d_type breaks
compilation on MinGW. This patch addresses this by using
'#if defined' as same as doing for virCgroupForDriver.
ENOENT happens normally when a subsystem is enabled with any other
subsystems and the directory of the target group has already removed
in a prior loop. In that case, the function should just return without
leaving an error message.
NB this is the same behavior as before introducing virCgroupRemoveRecursively.
When configuring serial, parallel, console or channel devices
with a file, dev or pipe backend type, it is necessary to label
the file path in the security drivers. For char devices of type
file, it is neccessary to pre-create (touch) the file if it does
not already exist since QEMU won't be allowed todo so itself.
dev/pipe configs already require the admin to pre-create before
starting the guest.
* src/qemu/qemu_security_dac.c: set file ownership for character
devices
* src/security/security_selinux.c: Set file labeling for character
devices
* src/qemu/qemu_driver.c: Add character devices to cgroup ACL
Through conversation with Kumar L Srikanth-B22348, I found
that the function of getting memory usage (e.g., virsh dominfo)
doesn't work for lxc with ns subsystem of cgroup enabled.
This is because of features of ns and memory subsystems.
Ns creates child cgroup on every process fork and as a result
processes in a container are not assigned in a cgroup for
domain (e.g., libvirt/lxc/test1/). For example, libvirt_lxc
and init (or somewhat specified in XML) are assigned into
libvirt/lxc/test1/8839/ and libvirt/lxc/test1/8839/8849/,
respectively. On the other hand, memory subsystem accounts
memory usage within a group of processes by default, i.e.,
it does not take any child (and descendant) groups into
account. With the two features, virsh dominfo which just
checks memory usage of a cgroup for domain always returns
zero because the cgroup has no process.
Setting memory.use_hierarchy of a group allows to account
(and limit) memory usage of every descendant groups of the group.
By setting it of a cgroup for domain, we can get proper memory
usage of lxc with ns subsystem enabled. (To be exact, the
setting is required only when memory and ns subsystems are
enabled at the same time, e.g., mount -t cgroup none /cgroup.)
As same as normal directories, a cgroup cannot be removed if it
contains sub groups. This patch changes virCgroupRemove to remove
all descendant groups (subdirectories) of a target group before
removing the target group.
The handling is required when we run lxc with ns subsystem of cgroup.
Ns subsystem automatically creates child cgroups on every process
forks, but unfortunately the groups are not removed on process exits,
so we have to remove them by ourselves.
With this patch, such child (and descendant) groups are surely removed
at lxc shutdown, i.e., lxcVmCleanup which calls virCgroupRemove.
The switch from %lli to %lld in virCgroupGetValueI64 is intended,
as virCgroupGetValueU64 uses base 10 too, and virCgroupSetValueI64
uses %lld to format the number to string.
Parsing is stricter now and doesn't accept trailing characters
after the actual value anymore.
Invoking virDomainSetMemory() on lxc driver results in libvirtd
segfault when cgroups has not been configured on the host.
Ensure driver->cgroup is non-null before invoking
virCgroupForDomain(). To prevent similar segfaults in the future,
ensure driver parameter to virCgroupForDomain() is non-null before
dereferencing.
When getting the driver/domain cgroup it is possible to specify
whether it should be auto created. If auto-creation was turned
off, libvirt still mistakenly created its own top level cgroup
* src/util/cgroup.c: Honour autocreate flag for top level cgroup