Two places would call to qemuPrepareCpumap() with priv->autoNodeset to
convert it to a cpuset. Remove the function and use the prepared cpuset
automatically.
When the default cpuset or automatic numa placement is used libvirt
would place the whole parent cgroup in the specified cpuset. This then
disallowed to re-pin the vcpus to a different cpu.
This patch pins only the vcpu threads to the default cpuset and thus
allows to re-pin them later.
The following config would fail to start:
<domain type='kvm'>
...
<vcpu placement='static' cpuset='0-1' current='2'>4</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='2-3'/>
...
This is a regression since a39f69d2b.
We've never set the cpuset.memory_migrate value to anything, keeping it
on default. However, we allow changing cpuset.mems on live domain.
That setting, however, don't have any consequence on a domain unless
it's going to allocate new memory.
I managed to make 'virsh numatune' move all the memory to any node I
wanted even without disabling libnuma's numa_set_membind(), so this
should be safe to use with it as well.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1198497
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
As pointed out by jtomko in his review of the IOThreads pinning code:
http://www.redhat.com/archives/libvir-list/2015-March/msg00495.html
there are some comments sprinkled in indicating IOThreads were using
the same structure as the VcpuPin code...
This is the first patch of a few that will change the virDomainVcpuPin*
structures and code to just virDomainPin* - starting with the data
structure naming...
There was a mess in the way how we store unlimited value for memory
limits and how we handled values provided by user. Internally there
were two possible ways how to store unlimited value: as 0 value or as
VIR_DOMAIN_MEMORY_PARAM_UNLIMITED. Because we chose to store memory
limits as unsigned long long, we cannot use -1 to represent unlimited.
It's much easier for us to say that everything greater than
VIR_DOMAIN_MEMORY_PARAM_UNLIMITED means unlimited and leave 0 as valid
value despite that it makes no sense to set limit to 0.
Remove unnecessary function virCompareLimitUlong. The update of test
is to prevent the 0 to be miss-used as unlimited in future.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1146539
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
If 'virNumaGetHostNodeset()' fails then the error path will try to free
uninitialized pointer mem_mask. Introduced by commit af2a1f058.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
systemd-machined introduced a new method CreateMachineWithNetwork
that obsoletes CreateMachine. It expects to be given a list of
VETH/TAP device indexes for the host side device(s) associated
with a container/machine.
This falls back to the old CreateMachine method when the new
one is not supported.
Commit af2a1f05 tried clearly separating each condition in
qemuRestoreCgroupState() for the sake of readability, however somehow
one condition body was missing. That means that the body of the next
condition got executed only if both of there were true, which is
impossible, thus resulting in a dead code and a logic error.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Instead of setting the value of cpuset.mems once when the domain starts
and then re-calculating the value every time we need to change the child
cgroup values, leave the cgroup alone and rather set the child data
every time there is new cgroup created. We don't leave any task in the
parent group anyway. This will ease both current and future code.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
If the memory mode is specified as 'strict' and with one node, we
get the following error when starting domain.
error: Unable to write to '$cgroup_path/cpuset.mems': Device or resource busy
XML is configured with numatune as follows:
<numatune>
<memory mode='strict' nodeset='0'/>
</numatune>
It's broken by Commit 411cea638f
which moved qemuSetupCgroupForEmulator() before setting cpuset.mems
in qemuSetupCgroupPostInit.
Directory '$cgroup_path/emulator/' is created in qemuSetupCgroupForEmulator.
But '$cgroup_path/emulator/cpuset.mems' it not set and has a default value
(all nodes, such as 0-1). Then we setup '$cgroup_path/cpuset.mems' to the
nodemask (in this case it's '0') in qemuSetupCgroupPostInit. It must fail.
This patch makes '$cgroup_path/emulator/cpuset.mems' is set before
'$cgroup_path/cpuset.mems'. The action is similar with that in
qemuDomainSetNumaParamsLive.
Signed-off-by: Wang Rui <moon.wangrui@huawei.com>
If the memory mode in numatune is specified as 'preferred' with one node
(such as nodeset='0'), domain's memory is not all in node 0 absolutely.
Assumption that node 0 doesn't have enough memory, memory can be allocated
on node 1 when qemu process startup. Then if we set cpuset.mems to '0',
it may invoke OOM.
Commit 1a7be8c600 changed the former logic of
checking memory mode in virDomainNumatuneGetNodeset. This patch adds the
check as before.
Signed-off-by: Wang Rui <moon.wangrui@huawei.com>
If we don't properly clean up all processes in the
machine-<vmname>.scope systemd won't remove the cgroup and subsequent vm
starts fail with
'CreateMachine: File exists'
Additional processes can e.g. be added via
echo $PID > /sys/fs/cgroup/systemd/machine.slice/machine-${VMNAME}.scope/tasks
but there are other cases like
http://bugs.debian.org/761521
Invoke TerminateMachine to be on the safe side since systemd tracks the
cgroup anyway. This is a noop if all processes have terminated already.
For the new VIR_DOMAIN_EVENT_ID_TUNABLE event we have a bunch of
constants added
VIR_DOMAIN_EVENT_CPUTUNE_<blah>
VIR_DOMAIN_EVENT_BLKDEVIOTUNE_<blah>
This naming convention is bad for two reasons
- There is no common prefix unique for the events to both
relate them, and distinguish them from other event
constants
- The values associated with the constants were chosen
to match the names used with virConnectGetAllDomainStats
so having EVENT in the constant name is not applicable in
that respect
This patch proposes renaming the constants to
VIR_DOMAIN_TUNABLE_CPU_<blah>
VIR_DOMAIN_TUNABLE_BLKDEV_<blah>
ie, given them a common VIR_DOMAIN_TUNABLE prefix.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Now we have universal tunable event so we can use it for reporting
changes to user. The cputune values will be prefixed with "cputune" to
distinguish it from other tunable events.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
In order to support cpuset setting, introduce qemuSetupCgroupIOThreadsPin
and qemuSetupCgroupForIOThreads to mimic the existing Vcpu API's.
These will support having an 'iotrhreadpin' element in the 'cpuset' in
order to pin named IOThreads to specific CPU's. The IOThread pin names
will follow the IOThread naming scheme starting at 1 (eg "iothread1")
up through an including the def->iothreads value.
Libvirt documents that the default entropy source for the 'random'
backend of a RNG device is /dev/random. Instead of storing and
propagating NULL across our code and checking it in multiple places fill
the default in the post parse callback and use that in the other places.
The "random" backend for virtio-rng can be started with no path
specified which equals to /dev/random. The cgroup code didn't consider
this and called few of the functions with NULL resulting into:
$ virsh start rng-vm
error: Failed to start domain rng-vm
error: Path '(null)' is not accessible: Bad address
Problem introduced by commit c6320d3463
Create the structures and API's to hold and manage the iSCSI host device.
This extends the 'scsi_host' definitions added in commit id '5c811dce'.
A future patch will add the XML parsing, but that code requires some
infrastructure to be in place first in order to handle the differences
between a 'scsi_host' and an 'iSCSI host' device.
Split virDomainHostdevSubsysSCSI further. In preparation for having
either SCSI or iSCSI data, create a union in virDomainHostdevSubsysSCSI
to contain just a virDomainHostdevSubsysSCSIHost to describe the
'scsi_host' host device
When domain is started with numatune memory mode strict and the
nodeset does not include host NUMA node with DMA and DMA32 zones, KVM
initialization fails. This is because cgroup restrict even kernel
allocations. We are already doing numa_set_membind() which does the
same thing, only it does not restrict kernel allocations.
This patch leaves the userspace numa_set_membind() in place and moves
the cpuset.mems setting after the point where monitor comes up, but
before vcpu and emulator sub-groups are created.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
There were numerous places where numatune configuration (and thus
domain config as well) was changed in different ways. On some
places this even resulted in persistent domain definition not to be
stable (it would change with daemon's restart).
In order to uniformly change how numatune config is dealt with, all
the internals are now accessible directly only in numatune_conf.c and
outside this file accessors must be used.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Since there was already public virDomainNumatune*, I changed the
private virNumaTune to match the same, so all the uses are unified and
public API is kept:
s/vir\(Domain\)\?Numa[tT]une/virDomainNumatune/g
then shrunk long lines, and mainly functions, that were created after
that:
sed -i 's/virDomainNumatuneMemPlacementMode/virDomainNumatunePlacement/g'
And to cope with the enum name, I haad to change the constants as
well:
s/VIR_NUMA_TUNE_MEM_PLACEMENT_MODE/VIR_DOMAIN_NUMATUNE_PLACEMENT/g
Last thing I did was at least a little shortening of already long
name:
s/virDomainNumatuneDef/virDomainNumatune/g
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
When creating cgroups for vcpu and emulator threads whilst starting a
domain, we explicitly skip creating those cgroups in case priv->cgroup
is NULL (cgroups not supported) because SetAffinity() serves the same
purpose. If the host supports only some cgroups (the ones we need are
either unmounted or disabled in qemu.conf), we error out with weird
message even though we could continue starting the domain.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1097028
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Add functions that will allow to set all the required cgroup stuff on
individual images taking a virStorageSourcePtr. Also convert functions
designed to setup whole backing chain to take advantage of the change.
In the future we might need to track state of individual images. Move
the readonly and shared flags to the virStorageSource struct so that we
can keep them in a per-image basis.
In the f56c773bf we've made the substitution but forgot to fix one
comment which is still referring to the old name. This may be
potentially misleading.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Since it is an abbreviation, USB should always be fully
capitalized or full lower case, never Usb.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Since it is an abbreviation, SCSI should always be fully
capitalized or full lower case, never Scsi.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently, the Linux kernel treats values of '0' and '1' as
the minimum of 2. Values larger than the maximum are changed
to the maximum.
Re-reading the shares value after setting it reflects this in
the live domain XML.
Currently, <cputune><shares>0</shares></cputune> is treated
as if it were not specified.
Treat is as a valid value if it was explicitly specified
and write it to the cgroups.
Any source file which calls the logging APIs now needs
to have a VIR_LOG_INIT("source.name") declaration at
the start of the file. This provides a static variable
of the virLogSource type.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
To support passing the path of the test data to the utils, one
more argument is added to virSCSIDeviceGetSgName,
virSCSIDeviceGetDevName, and virSCSIDeviceNew, and the related
code is changed accordingly.
Later tests for the scsi utils will be based on this patch.
Signed-off-by: Osier Yang <jyang@redhat.com>
Creating a qemu VM with /dev/hwrng as backend RNG device throws the
following error - "Could not open '/dev/hwrng': Permission denied"
This patch fixes the issue
Signed-off-by: Pradipta Kr. Banerjee <bpradip@in.ibm.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Unlike the host devices of other types, SCSI host device XML supports
"shareable" tag. This patch introduces it for the virSCSIDevice struct
for a later patch use (to detect if the SCSI device is shareable when
preparing the SCSI host device in QEMU driver).
This patch introduces virCgroupSetBlkioDeviceReadIops,
virCgroupSetBlkioDeviceWriteIops,
virCgroupSetBlkioDeviceReadBps and
virCgroupSetBlkioDeviceWriteBps,
we can use these interfaces to set up throttle
blkio cgroup for domain.
This patch also adds the new throttle blkio cgroup
elements to the test xml.
Signed-off-by: Guan Qiang <hzguanqiang@corp.netease.com>
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Most of our code base uses space after comma but not before;
fix the remaining uses before adding a syntax check.
* src/qemu/qemu_cgroup.c: Consistently use commas.
* src/qemu/qemu_command.c: Likewise.
* src/qemu/qemu_conf.c: Likewise.
* src/qemu/qemu_driver.c: Likewise.
* src/qemu/qemu_monitor.c: Likewise.
Signed-off-by: Eric Blake <eblake@redhat.com>
On my machine, a guest fails to boot if it has a sound card, but not
graphical device/display is configured, because pulseaudio fails to
initialize since it can't access $HOME.
A workaround is removing the audio device, however on ARM boards there
isn't any option to do that, so -nographic always fails.
Set QEMU_AUDIO_DRV=none if no <graphics> are configured. Unfortunately
this has massive test suite fallout.
Add a qemu.conf parameter nographics_allow_host_audio, that if enabled
will pass through QEMU_AUDIO_DRV from sysconfig (similar to
vnc_allow_host_audio)
Since 16bcb3 we have a regression. The hard_limit is set
unconditionally. By default the limit is zero. Hence, if user hasn't
configured any, we set the zero in cgroup subsystem making the kernel
kill the corresponding qemu process immediately. The proper fix is to
set hard_limit iff user has configured any.
This function is to guess the correct limit for maximal memory
usage by qemu for given domain. This can never be guessed
correctly, not to mention all the pains and sleepless nights this
code has caused. Once somebody discovers algorithm to solve the
Halting Problem, we can compute the limit algorithmically. But
till then, this code should never see the light of the release
again.
If upgrading from a libvirt that is older than 1.0.5, we can
not assume that vm->def->resource is non-NULL. This bogus
assumption caused libvirtd to crash
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Make the virCgroupNewMachine method try to use systemd-machined
first. If that fails, then fallback to using the traditional
cgroup setup code path.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of requiring drivers to use a combination of calls
to virCgroupNewDetect and virCgroupIsValidMachine, combine
the two into virCgroupNewDetectMachine
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently the QEMU driver creates the VM's cgroup prior to
forking, and then uses a virCommand hook to move the child
into the cgroup. This won't work with systemd whose APIs
do the creation of cgroups + attachment of processes atomically.
Fortunately we have a handshake taking place between the
QEMU driver and the child process prior to QEMU being exec()d,
which was introduced to allow setup of disk locking. By good
fortune this synchronization point can be used to enable the
QEMU driver to do atomic setup of cgroups removing the use
of the hook script.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Use the new virCgroupNewDetect function to determine cgroup
placement of existing running VMs. This will allow the legacy
cgroups creation APIs to be removed entirely
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Convert the remaining methods in vircgroup.c to report errors
instead of returning errno values.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Convert the type of loop iterators named 'i', 'j', k',
'ii', 'jj', 'kk', to be 'size_t' instead of 'int' or
'unsigned int', also santizing 'ii', 'jj', 'kk' to use
the normal 'i', 'j', 'k' naming
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
I realized after the fact that it's probably better in the long run to
give this function a name that matches the name of the link used in
sysfs to hold the group (iommu_group).
I'm changing it now because I'm about to add several more functions
that deal with iommu groups.
Change bbe97ae968 caused the
QEMU driver to ignore ENOENT errors from cgroups, in order
to cope with missing /proc/cgroups. This is not good though
because many other things can cause ENOENT and should not
be ignored. The callers expect to see ENXIO when cgroups
are not present, so adjust the code to report that errno
when /proc/cgroups is missing
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Found that I was unable to start existing domains after updating
to a kernel with no cgroups support
# zgrep CGROUP /proc/config.gz
# CONFIG_CGROUPS is not set
# virsh start test
error: Failed to start domain test
error: Unable to initialize /machine cgroup: Cannot allocate memory
virCgroupPartitionNeedsEscaping() correctly returns errno (ENOENT) when
attempting to open /proc/cgroups on such a system, but it was being
dropped in virCgroupSetPartitionSuffix().
Change virCgroupSetPartitionSuffix() to propagate errors returned by
its callees. Also check for ENOENT in qemuInitCgroup() when determining
if cgroups support is available.
This adds the scsi-generic device into the device controller's
whitelist, so that it's allowed to used by the qemu process.
Signed-off-by: Han Cheng <hanc.fnst@cn.fujitsu.com>
Signed-off-by: Osier Yang <jyang@redhat.com>
I must have looked at this a couple dozen times before I noticed it
had "!=" instead of "==". Not doing this setup prevented qemu from
doing anything with the vfio group device.
The source code base needs to be adapted as well. Some files
include virutil.h just for the string related functions (here,
the include is substituted to match the new file), some include
virutil.h without any need (here, the include is removed), and
some require both.
The USB-specific cgroup setup had been inserted inline in
qemuDomainAttachHostUsbDevice and qemuSetupCgroup, but now there is a
common cgroup setup function called for all hostdevs, so it makes sens
to put the usb-specific setup there and just rely on that function
being called.
The one thing I'm uncertain of here (and a reason for not pushing
until after release) is that previously hostdev->missing was checked
only when starting a domain (and cgroup setup for the device skipped
if missing was true), but with this consolidation, it is now checked
in the case of hotplug as well. I don't know if this will have any
practical effect (does it make sense to hotplug a "missing" usb
device?)
PCIO device assignment using VFIO requires read/write access by the
qemu process to /dev/vfio/vfio, and /dev/vfio/nn, where "nn" is the
VFIO group number that the assigned device belongs to (and can be
found with the function virPCIDeviceGetVFIOGroupDev)
/dev/vfio/vfio can be accessible to any guest without danger
(according to vfio developers), so it is added to the static ACL.
The group device must be dynamically added to the cgroup ACL for each
vfio hostdev in two places:
1) for any devices in the persistent config when the domain is started
(done during qemuSetupCgroup())
2) at device attach time for any hotplug devices (done in
qemuDomainAttachHostDevice)
The group device must be removed from the ACL when a device it
"hot-unplugged" (in qemuDomainDetachHostDevice())
Note that USB devices are already doing their own cgroup setup and
teardown in the hostdev-usb specific function. I chose to make the new
functions generic and call them in a common location though. We can
then move the USB-specific code (which is duplicated in two locations)
to this single location. I'll be posting a followup patch to do that.
The change in commit aed4986322
was incomplete, missing a couple of cases of /system. This
caused failure to start VMs.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
After discussions with systemd developers it was decided that
a better default policy for resource partitions is to have
3 default partitions at the top level
/system - system services
/machine - virtual machines / containers
/user - user login session
This ensures that the default policy isolates guest from
user login sessions & system services, so a mis-behaving
guest can't consume 100% of CPU usage if other things are
contending for it.
Thus we change the default partition from /system to
/machine
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The virCgroupNewDriver method had a 'bool privileged' param.
If a false value was ever passed in, it would simply not
work, since non-root users don't have any privileges to create
new cgroups. Just delete this broken code entirely and make
the QEMU driver skip cgroup setup in non-privileged mode
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Historically QEMU/LXC guests have been placed in a cgroup layout
that is
$LOCATION-OF-LIBVIRTD/libvirt/{qemu,lxc}/$VMNAME
This is bad for a number of reasons
- The cgroup hierarchy gets very deep which seriously
impacts kernel performance due to cgroups scalability
limitations.
- It is hard to setup cgroup policies which apply across
services and virtual machines, since all VMs are underneath
the libvirtd service.
To address this the default cgroup location is changed to
be
/system/$VMNAME.{lxc,qemu}.libvirt
This puts virtual machines at the same level in the hierarchy
as system services, allowing consistent policy to be setup
across all of them.
This also honours the new resource partition location from the
XML configuration, for example
<resource>
<partition>/virtualmachines/production</partitions>
</resource>
will result in the VM being placed at
/virtualmachines/production/$VMNAME.{lxc,qemu}.libvirt
NB, with the exception of the default, /system, path which
is intended to always exist, libvirt will not attempt to
auto-create the partitions in the XML. It is the responsibility
of the admin/app to configure the partitions. Later libvirt
APIs will provide a way todo this.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
A resource partition is an absolute cgroup path, ignoring the
current process placement. Expose a virCgroupNewPartition API
for constructing such cgroups
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Rename all the virCgroupForXXX methods to use the form
virCgroupNewXXX since they are all constructors. Also
make sure the output parameter is the last one in the
list, and annotate all pointers as non-null. Fix up
all callers, and make sure they use true/false not 0/1
for the boolean parameters
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Instead of calling virCgroupForDomain every time we need
the virCgrouPtr instance, just do it once at Vm startup
and cache a reference to the object in qemuDomainObjPrivatePtr
until shutdown of the VM. Removing the virCgroupPtr from
the QEMU driver state also means we don't have stale mount
info, if someone mounts the cgroups filesystem after libvirtd
has been started
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Some refactoring for virDomainChrSourceDef type of devices so
we can use common code.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
Reviewed-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
Tested-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
The virCgroupMounted method is badly named, since a controller can be
mounted, but disabled in the current object. Rename the method to be
virCgroupHasController. Also make it tolerant to a NULL virCgroupPtr
and out-of-range controller index, to avoid duplication of these
checks in all callers
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently when getting an instance of virCgroupPtr we will
create the path in all cgroup controllers. Only at the virt
driver layer are we attempting to filter controllers. This
is bad because the mere act of creating the dirs in the
controllers can have a functional impact on the kernel,
particularly for performance.
Update the virCgroupForDriver() method to accept a bitmask
of controllers to use. Only create dirs in the controllers
that are requested. When creating cgroups for domains,
respect the active controller list from the parent cgroup
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Intend to reduce the redundant code,use virNumaSetupMemoryPolicy
to replace virLXCControllerSetupNUMAPolicy and
qemuProcessInitNumaMemoryPolicy.
This patch also moves the numa related codes to the
file virnuma.c and virnuma.h
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
The QEMU driver has a list of devices nodes that are whitelisted
for all guests. The kernel has recently started returning an
error if you try to whitelist a device which does not exist.
This causes a warning in libvirt logs and an audit error for
any missing devices. eg
2013-02-27 16:08:26.515+0000: 29625: warning : virDomainAuditCgroup:451 : success=no virt=kvm resrc=cgroup reason=allow vm="vm031714" uuid=9d8f1de0-44f4-a0b1-7d50-e41ee6cd897b cgroup="/sys/fs/cgroup/devices/libvirt/qemu/vm031714/" class=path path=/dev/kqemu rdev=? acl=rw
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
The code for putting the emulator threads in a separate cgroup
would spam the logs with warnings
2013-02-27 16:08:26.731+0000: 29624: warning : virCgroupMoveTask:887 : no vm cgroup in controller 3
2013-02-27 16:08:26.731+0000: 29624: warning : virCgroupMoveTask:887 : no vm cgroup in controller 4
2013-02-27 16:08:26.732+0000: 29624: warning : virCgroupMoveTask:887 : no vm cgroup in controller 6
This is because it has only created child cgroups for 3 of the
controllers, but was trying to move the processes from all the
controllers. The fix is to only try to move threads in the
controllers we actually created. Also remove the warning and
make it return a hard error to avoid such lazy callers in the
future.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=896685 points out
a regression caused by commit 38c4a9c - libvirt only labels
the backing chain if the backing chain cache is populated, but
the code to populate the cache was only conditionally performed
if cgroup labeling was necessary.
* src/qemu/qemu_cgroup.c (qemuSetupCgroup): Hoist cache setup...
* src/qemu/qemu_process.c (qemuProcessStart): ...earlier into
caller, where it is now unconditional.
When iterating over USB host devices to setup cgroups, the
usbDevice object was leaked in both LXC and QEMU driers
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Currently the virQEMUDriverPtr struct contains an wide variety
of data with varying access needs. Move all the static config
data into a dedicated virQEMUDriverConfigPtr object. The only
locking requirement is to hold the driver lock, while obtaining
an instance of virQEMUDriverConfigPtr. Once a reference is held
on the config object, it can be used completely lockless since
it is immutable.
NB, not all APIs correctly hold the driver lock while getting
a reference to the config object in this patch. This is safe
for now since the config is never updated on the fly. Later
patches will address this fully.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
While OOM can have knock-on effects that trash a system, generally
the first symptom is one of memory thrashing.
* src/qemu/qemu_cgroup.c (qemuSetupCgroup): Reword slightly.
Currently, if there's no hard memory limit defined for a domain,
libvirt tries to calculate one, based on domain definition and magic
equation and set it upon the domain startup. The rationale behind was,
if there's a memory leak or exploit in qemu, we should prevent the
host system trashing. However, the equation was too tightening, as it
didn't reflect what the kernel counts into the memory used by a
process. Since many hosts do have a swap, nobody hasn't noticed
anything, because if hard memory limit is reached, process can
continue allocating memory on a swap. However, if there is no swap on
the host, the process gets killed by OOM killer. In our case, the qemu
process it is.
To prevent this, we need to relax the hard RSS limit. Moreover, we
should reflect more precisely the kernel way of accounting the memory
for process. That is, even the kernel caches are counted within the
memory used by a process (within cgroups at least). Hence the magic
equation has to be changed:
limit = 1.5 * (domain memory + total video memory) + (32MB for cache
per each disk) + 200MB
To bring in line with new naming practice, rename the=
src/util/cgroup.{h,c} files to vircgroup.{h,c}
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
When LXC labels USB devices during hotplug, it is running in
host context, so it needs to pass in a vroot path to the
container root.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Remove the obsolete 'qemud' naming prefix and underscore
based type name. Introduce virQEMUDriverPtr as the replacement,
in common with LXC driver naming style
The libvirt coding standard is to use 'function(...args...)'
instead of 'function (...args...)'. A non-trivial number of
places did not follow this rule and are fixed in this patch.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
When the cpu placement model is "auto", it sets the affinity for
domain process with the advisory nodeset from numad, however,
creating cgroup for the domain process (called emulator thread
in some contexts) later overrides that with pinning it to all
available pCPUs.
How to reproduce:
* Configure the domain with "auto" placement for <vcpu>, e.g.
<vcpu placement='auto'>4</vcpu>
* % virsh start dom
* % cat /proc/$dompid/status
Though the emulator cgroup cause conflicts, but we can't simply
prohibit creating it, as other tunables are still useful, such
as "emulator_period", which is used by API
virDomainSetSchedulerParameter. So this patch doesn't prohibit
creating the emulator cgroup, but inherit the nodeset from numad,
and reset the affinity for domain process.
* src/qemu/qemu_cgroup.h: Modify definition of qemuSetupCgroupForEmulator
to accept the passed nodenet
* src/qemu/qemu_cgroup.c: Set the affinity with the passed nodeset
We used to walk the backing file chain at least twice per disk,
once to set up cgroup device whitelisting, and once to set up
security labeling. Rather than walk the chain every iteration,
which possibly includes calls to fork() in order to open root-squashed
NFS files, we can exploit the cache of the previous patch.
* src/conf/domain_conf.h (virDomainDiskDefForeachPath): Alter
signature.
* src/conf/domain_conf.c (virDomainDiskDefForeachPath): Require caller
to supply backing chain via disk, if recursion is desired.
* src/security/security_dac.c
(virSecurityDACSetSecurityImageLabel): Adjust caller.
* src/security/security_selinux.c
(virSecuritySELinuxSetSecurityImageLabel): Likewise.
* src/security/virt-aa-helper.c (get_files): Likewise.
* src/qemu/qemu_cgroup.c (qemuSetupDiskCgroup)
(qemuTeardownDiskCgroup): Likewise.
(qemuSetupCgroup): Pre-populate chain.
According to our recent changes (clarifications), we should be pinning
qemu's emulator processes using the <vcpu> 'cpuset' attribute in case
there is no <emulatorpin> specified. This however doesn't work
entirely as expected and this patch should resolve all the remaining
issues.
https://www.gnu.org/licenses/gpl-howto.html recommends that
the 'If not, see <url>.' phrase be a separate sentence.
* tests/securityselinuxhelper.c: Remove doubled line.
* tests/securityselinuxtest.c: Likewise.
* globally: s/; If/. If/
This is another fix for the emulator-pin series. When going through
the cputune pinning settings, the current code is trying to pin all
the CPUs, even when not all of them are specified. This causes error
in the subsequent function which, of course, cannot find the cpu to
pin. Since it's enough to pass the correct VCPU ID to the function,
the fix is trivial.
When domain XML contains any of the elements for setting up CPU
scheduling parameters (period, quota, emulator_period, or
emulator_quota) we need cpu cgroup to enforce the configuration.
However, the existing code would just ignore silently such settings if
either cgroups were not available at all cpu cgroup was not available.
Moreover, APIs for manipulating CPU scheduler parameters were already
failing if cpu cgroup was not available. This patch makes cpu cgroup
mandatory for all domains that use CPU scheduling elements in their XML.
If cgroups are enabled in general but cpu cgroup is disabled in
qemu.conf or not mounted at all, libvirt would refuse to start any
domain even though scheduler parameters are not set in domain XML.
This patch makes cpu cgroup mandatory only for domains that actually
want to use it.
Commit 4b03d59167 changed the pinning
behavior in a way that makes some machines non-startable.
The comment mentioning that we cannot control each vcpu when there is
not VCPU<-> PID mapping available is true, however, this isn't
necessarily an error, because this can be caused by old QEMU without
support for "query-cpus" command as well as a software emulated
machines that don't create more than one process.
This patch introduces support of setting emulator's period and
quota to limit cpu bandwidth when the vm starts. Also updates
XML Schema for new entries and docs.
This patch changes the behaviour of xml element cputune.period
and cputune.quota to limit cpu bandwidth only for vcpus, and no
longer limit cpu bandwidth for the whole guest.
The reasons to do this are:
- This matches docs of cputune.period and cputune.quota.
- The other parts excepting vcpus are treated as "emulator",
and there are separate period/quota settings for emulator
in the subsequent patches
Introduce qemuSetupCgroupEmulatorPin() function to add emulator
threads pin info to cpuset cgroup, the same as vcpupin.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
vcpu threads pin are implemented using sched_setaffinity(), but
not controlled by cgroup. This patch does the following things:
1) enable cpuset cgroup
2) reflect all the vcpu threads pin info to cgroup
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Create a new cgroup and move all emulator threads to the new cgroup.
And then we can do the other things:
1. limit only vcpu usage rather than the whole qemu
2. limit for emulator threads(include vhost-net threads)
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
If there's a memory leak in qemu or qemu is exploited the host's
system will sooner or later start trashing instead of killing
the bad process. This however has impact on performance and other
guests as well. Therefore we should set a reasonable RSS limit
even when user hasn't set any. It's better to be secure by default.
Any time we have a string with no % passed through gettext, a
translator can inject a % to cause a stack overread. When there
is nothing to format, it's easier to ask for a string that cannot
be used as a formatter, by using a trivial "%s" format instead.
In the past, we have used --disable-nls to catch some of the
offenders, but that doesn't get run very often, and many more
uses have crept in. Syntax check to the rescue!
The syntax check can catch uses such as
virReportError(code,
_("split "
"string"));
by using a sed script to fold context lines into one pattern
space before checking for a string without %.
This patch is just mechanical insertion of %s; there are probably
several messages touched by this patch where we would be better
off giving the user more information than a fixed string.
* cfg.mk (sc_prohibit_diagnostic_without_format): New rule.
* src/datatypes.c (virUnrefConnect, virGetDomain)
(virUnrefDomain, virGetNetwork, virUnrefNetwork, virGetInterface)
(virUnrefInterface, virGetStoragePool, virUnrefStoragePool)
(virGetStorageVol, virUnrefStorageVol, virGetNodeDevice)
(virGetSecret, virUnrefSecret, virGetNWFilter, virUnrefNWFilter)
(virGetDomainSnapshot, virUnrefDomainSnapshot): Add %s wrapper.
* src/lxc/lxc_driver.c (lxcDomainSetBlkioParameters)
(lxcDomainGetBlkioParameters): Likewise.
* src/conf/domain_conf.c (virSecurityDeviceLabelDefParseXML)
(virDomainDiskDefParseXML, virDomainGraphicsDefParseXML):
Likewise.
* src/conf/network_conf.c (virNetworkDNSHostsDefParseXML)
(virNetworkDefParseXML): Likewise.
* src/conf/nwfilter_conf.c (virNWFilterIsValidChainName):
Likewise.
* src/conf/nwfilter_params.c (virNWFilterVarValueCreateSimple)
(virNWFilterVarAccessParse): Likewise.
* src/libvirt.c (virDomainSave, virDomainSaveFlags)
(virDomainRestore, virDomainRestoreFlags)
(virDomainSaveImageGetXMLDesc, virDomainSaveImageDefineXML)
(virDomainCoreDump, virDomainGetXMLDesc)
(virDomainMigrateVersion1, virDomainMigrateVersion2)
(virDomainMigrateVersion3, virDomainMigrate, virDomainMigrate2)
(virStreamSendAll, virStreamRecvAll)
(virDomainSnapshotGetXMLDesc): Likewise.
* src/nwfilter/nwfilter_dhcpsnoop.c (virNWFilterSnoopReqLeaseDel)
(virNWFilterDHCPSnoopReq): Likewise.
* src/openvz/openvz_driver.c (openvzUpdateDevice): Likewise.
* src/openvz/openvz_util.c (openvzKBPerPages): Likewise.
* src/qemu/qemu_cgroup.c (qemuSetupCgroup): Likewise.
* src/qemu/qemu_command.c (qemuBuildHubDevStr, qemuBuildChrChardevStr)
(qemuBuildCommandLine): Likewise.
* src/qemu/qemu_driver.c (qemuDomainGetPercpuStats): Likewise.
* src/qemu/qemu_hotplug.c (qemuDomainAttachNetDevice): Likewise.
* src/rpc/virnetsaslcontext.c (virNetSASLSessionGetIdentity):
Likewise.
* src/rpc/virnetsocket.c (virNetSocketNewConnectUNIX)
(virNetSocketSendFD, virNetSocketRecvFD): Likewise.
* src/storage/storage_backend_disk.c
(virStorageBackendDiskBuildPool): Likewise.
* src/storage/storage_backend_fs.c
(virStorageBackendFileSystemProbe)
(virStorageBackendFileSystemBuild): Likewise.
* src/storage/storage_backend_rbd.c
(virStorageBackendRBDOpenRADOSConn): Likewise.
* src/storage/storage_driver.c (storageVolumeResize): Likewise.
* src/test/test_driver.c (testInterfaceChangeBegin)
(testInterfaceChangeCommit, testInterfaceChangeRollback):
Likewise.
* src/vbox/vbox_tmpl.c (vboxListAllDomains): Likewise.
* src/xenxs/xen_sxpr.c (xenFormatSxprDisk, xenFormatSxpr):
Likewise.
* src/xenxs/xen_xm.c (xenXMConfigGetUUID, xenFormatXMDisk)
(xenFormatXM): Likewise.
Per the FSF address could be changed from time to time, and GNU
recommends the following now: (http://www.gnu.org/licenses/gpl-howto.html)
You should have received a copy of the GNU General Public License
along with Foobar. If not, see <http://www.gnu.org/licenses/>.
This patch removes the explicit FSF address, and uses above instead
(of course, with inserting 'Lesser' before 'General').
Except a bunch of files for security driver, all others are changed
automatically, the copyright for securify files are not complete,
that's why to do it manually:
src/security/security_selinux.h
src/security/security_driver.h
src/security/security_selinux.c
src/security/security_apparmor.h
src/security/security_apparmor.c
src/security/security_driver.c
The only useful translation of "%s" as a format string is "%s" (I
suppose you could claim "%1$s" is also valid, but why bother). So
it is not worth translating; fixing this exposes some instances
where we were failing to translate real error messages. This makes
the fix of commit 097da1ab more generic, as well as ensuring no
future regressions.
* cfg.mk (sc_prohibit_useless_translation): New rule.
* src/lxc/lxc_driver.c (lxcSetVcpuBWLive): Fix offender.
* src/openvz/openvz_conf.c (openvzReadFSConf): Likewise.
* src/qemu/qemu_cgroup.c (qemuSetupCgroupForVcpu): Likewise.
* src/qemu/qemu_driver.c (qemuSetVcpusBWLive): Likewise.
* src/xenapi/xenapi_utils.c (xenapiSessionErrorHandle): Likewise.
Like for 'static' placement, when the memory policy mode is
'strict', set the memory policy by writing the advisory nodeset
returned from numad to cgroup file cpuset.mems,