This is from a bug report and conversation on IRC where Soren reported that while a filter update is occurring on one or more VMs (due to a rule having been edited for example), a deadlock can occur when a VM referencing a filter is started.
The problem is caused by the two locking sequences of
qemu driver, qemu domain, filter # for the VM start operation
filter, qemu_driver, qemu_domain # for the filter update operation
that obviously don't lock in the same order. The problem is the 2nd lock sequence. Here the qemu_driver lock is being grabbed in qemu_driver:qemudVMFilterRebuild()
The following solution is based on the idea of trying to re-arrange the 2nd sequence of locks as follows:
qemu_driver, filter, qemu_driver, qemu_domain
and making the qemu driver recursively lockable so that a second lock can occur, this would then lead to the following net-locking sequence
qemu_driver, filter, qemu_domain
where the 2nd qemu_driver lock has been ( logically ) eliminated.
The 2nd part of the idea is that the sequence of locks (filter, qemu_domain) and (qemu_domain, filter) becomes interchangeable if all code paths where filter AND qemu_domain are locked have a preceding qemu_domain lock that basically blocks their concurrent execution
So, the following code paths exist towards qemu_driver:qemudVMFilterRebuild where we now want to put a qemu_driver lock in front of the filter lock.
-> nwfilterUndefine() [ locks the filter ]
-> virNWFilterTestUnassignDef()
-> virNWFilterTriggerVMFilterRebuild()
-> qemudVMFilterRebuild()
-> nwfilterDefine()
-> virNWFilterPoolAssignDef() [ locks the filter ]
-> virNWFilterTriggerVMFilterRebuild()
-> qemudVMFilterRebuild()
-> nwfilterDriverReload()
-> virNWFilterPoolLoadAllConfigs()
->virNWFilterPoolObjLoad()
-> virNWFilterPoolAssignDef() [ locks the filter ]
-> virNWFilterTriggerVMFilterRebuild()
-> qemudVMFilterRebuild()
-> nwfilterDriverStartup()
-> virNWFilterPoolLoadAllConfigs()
->virNWFilterPoolObjLoad()
-> virNWFilterPoolAssignDef() [ locks the filter ]
-> virNWFilterTriggerVMFilterRebuild()
-> qemudVMFilterRebuild()
Qemu is not the only driver using the nwfilter driver, but also the UML driver calls into it. Therefore qemuVMFilterRebuild() can be exchanged with umlVMFilterRebuild() along with the driver lock of qemu_driver that can now be a uml_driver. Further, since UML and Qemu domains can be running on the same machine, the triggering of a rebuild of the filter can touch both types of drivers and their domains.
In the patch below I am now extending each nwfilter callback driver with functions for locking and unlocking the (VM) driver (UML, QEMU) and introduce new functions for locking all registered callback drivers and unlocking them. Then I am distributing the lock-all-cbdrivers/unlock-all-cbdrivers call into the above call paths. The last shown callpath starting with nwfilterDriverStart() is problematic since it is initialize before the Qemu and UML drives are and thus a lock in the path would result in a NULL pointer attempted to be locked -- the call to virNWFilterTriggerVMFilterRebuild() is never called, so we never lock either the qemu_driver or the uml_driver in that path. Therefore, only the first 3 paths now receive calls to lock and unlock all callback drivers. Now that the locks are distributed where it matters I can remove the qemu_driver and uml_driver lock from qemudVMFilterRebuild() and umlVMFilterRebuild() and not requiring the recursive locks.
For now I want to put this out as an RFC patch. I have tested it by 'stretching' the critical section after the define/undefine functions each lock the filter so I can (easily) concurrently execute another VM operation (suspend,start). That code is in this patch and if you want you can de-activate it. It seems to work ok and operations are being blocked while the update is being done.
I still also want to verify the other assumption above that locking filter and qemu_domain always has a preceding qemu_driver lock.
Make use of the existing <filesystem> element to support plan9fs
filesystem passthrough in the QEMU driver
<filesystem type='mount'>
<source dir='/export/to/guest'/>
<target dir='/import/from/host'/>
</filesystem>
NB, the target is not actually a directory, it is merely a arbitrary
string tag that is exported to the guest as a hint for where to mount
it.
Adding parsing code for memory tunables in the domain xml file
also change the internal define structures used for domain memory
informations
Adds a new specific test
Public api to set/get memory tunables supported by the hypervisors.
dv:
* some cleanups in libvirt.c
* adding extra checks in libvirt.c new entry points
v4:
* Move exporting public API to this patch
* Add unsigned int flags to the public api for future extensions
v3:
* Add domainGetMemoryParamters and NULL in all the driver interface
v2:
* Initialize domainSetMemoryParameters to NULL in all the driver
interface structure.
A QEMU guest can have upto VIR_DOMAIN_BOOT_LAST boot entries
defined. When building the QEMU arg, each entry takes a
single byte. This means the array must be declared to be
VIR_DOMAIN_BOOT_LAST+1 bytes in length to allow for the
trailing null
* src/qemu/qemu_conf.c: Fix off-by-1 boot arg array size
QMP in QEMU 0.13 has been fixed to enforce type correctness,
this means that boolean types must be true or false, not
integers.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
QMP in QEMU 0.13 has been fixed to enforce type correctness,
this means that boolean types must be true or false, not
integers.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Other drivers will need this same functionality, so move it to up to
conf/domain_conf.c and give it a more general name.
Signed-off-by: Soren Hansen <soren@linux2go.dk>
Previously QEMU enabled KQEMU by default and had -no-kqemu.
0.11.x switched to requiring -enable-kqemu. 0.12.x dropped
kqemu entirely. This patch adds support for -enable-kqemu
so 0.11.x works. It replaces a huge set of if() with a
switch() to make the code a bit more readable.
* src/qemu/qemu_conf.c, src/qemu/qemu_conf.h: Support
-enable-kqemu
We already filled the PCI address structure when we checked whether it's
free or not, so let's just use the structure here instead of filling it
again.
The current version of the qemu managed save implementation
is subject to a race where the domain shuts down between
the time that we start the command and the time that we
actually try to do the save. Close this race by making
qemuDomainSaveFlags() expect both the driver and the passed-in
vm object to be locked before executing.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
When reconnecting to existing VMs, we re-reserved only those PCI
addresses which were explicitly mentioned in domain XML. Since some
addresses are always reserved (e.g., 0:0:0 and 0:0:1), we need to handle
those too.
Also all this should only be done if device flag is supported by qemu.
In this patch I am extending and fixing the nwfilter module's reload support to stop all ongoing threads (for learning IP addresses of interfaces) and rebuild the filtering rules of all interfaces of all VMs when libvirt is started. Now libvirtd rebuilds the filters upon the SIGHUP signal and libvirtd restart.
About the patch: The nwfilter functions require a virConnectPtr. Therefore I am opening a connection in qemudStartup, which later on needs to be closed outside where the driver lock is held since otherwise it ends up in a deadlock due to virConnectClose() trying to lock the driver as well.
I have tested this now for a while with several machines running and needing the IP address learner thread(s). The rebuilding of the firewall rules seems to work fine following libvirtd restart or a SIGHUP. Also the termination of libvirtd worked fine.
Since the qemu process is running as qemu:qemu, it can't actually
look at the unix socket in /var/run/libvirt/qemu which is owned by
root and has permission 700. Move the unix socket to
/var/lib/libvirt/qemu, which is already owned by qemu:qemu.
Thanks to Justin Clift for test this out for me.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
The problem is that on the source of the migration, libvirtd
is responsible for creating the unix socket over which the data
will flow. Since libvirtd is running as root, this file will
be created as root. When the qemu process running as qemu:qemu
goes to access the unix file to write data to it, it will get
permission denied and fail. Make sure to change the owner
of the unix file to qemu:qemu.
Thanks to Justin Clift for testing this patch out for me.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Basically a followup of the previous patch about balloon desactivation
if desactivated, to not ask for balloon information to qemu as we will
just get an error back.
This can make a huge difference in the time needed for domain
information or list when a machine is loaded, and balloon has been
desactivated in the guests.
* src/qemu/qemu_driver.c: do not get the balloon info if the balloon
suppor is disabled
The balloon device is automatically added to qemu guests if supported,
but it may be useful to desactivate it. The simplest to not change the
existing behaviour is to allow
<memballoon type="none"/>
as an extra option to desactivate it (it is automatically added if the
memballoon construct is missing for the domain).
The following simple patch just adds the extra option and does not
change the default behaviour but avoid creating a balloon device if
type="none" is used.
* docs/schemas/domain.rng: add the extra type attribute value
* src/conf/domain_conf.c src/conf/domain_conf.h: add the extra enum
value
* src/qemu/qemu_conf.c: if enum is NONE, don't activate the device,
i.e. don't pass the args to qemu/kvm
device_del command is not synchronous for PCI devices, it merely asks
the guest to release the device and returns. If the host wants to use
that device before the guest actually releases it, we are in big
trouble. To avoid this, we already added a loop which waits up to 10
seconds until the device is actually released before we do anything else
with that device. But we only added this loop for managed PCI devices
before we try reattach them back to the host.
However, we need to wait even for non-managed devices. We don't reattach
them automatically, but we still want to prevent the host from using it.
This was revealed thanks to sVirt: when we relabel sysfs files
corresponding to the PCI device before the guest finished releasing the
device, qemu is no longer allowed to access those files and if it wants
(as a result of guest's request) to write anything to them, it just
exits, which kills the guest.
This is not a proper fix and needs some further work both on libvirt and
qemu side in the future.
Fix the error checking to use the return value from brAddTap() instead
of checking the current errno value which might have been changed by
clean up calls inside of brAddTap().
Signed-off-by: Doug Goldstein <cardoe@gentoo.org>
Added a more detailed error message when adding a tap devices fails and
the kernel is missing tun support.
Signed-off-by: Doug Goldstein <cardoe@gentoo.org>
the followup on the boot=on problem, basically it's not needed to
specify it when booting out of IDE devices when using KVM
* src/qemu/qemu_conf.c: do not use boot=on for IDE devices
* tests/qemuxml2argvdata/qemuxml2argv*.args: this changes the output
for 5 of the tests
Patch version revamped by Eric Blake <eblake@redhat.com> of Jiri
Denemark <jdenemar@redhat.com> original patch
When attaching a PCI device which doesn't explicitly set its PCI
address, libvirt allocates the address automatically. The problem is
that when checking which PCI address is unused, we only check for those
with slot number higher than the highest slot number ever used.
Thus attaching/detaching such device several times in a row (31 is the
theoretical limit, less then 30 tries are enough in practise) makes any
further device attachment fail. Furthermore, attaching a device with
predefined PCI address to 0:0:31 immediately forbids attachment of any
PCI device without explicit address.
This patch changes the logic so that we always check all PCI addresses
before we say there is no PCI address available.
Modifications from v1: revert back to remembering the last slot
reserved, but allow wraparound to not be limited by the end.
In this way, slots are still assigned in the same order as
before the patch, rather than filling in the gaps closest to
0 and risking making windows guests mad.
* src/qemu/qemu_conf.c: fix pci reservation code to do a round-robbin
check of all available PCI splot availability before failing.
Basically the 'boot=on' boot selection device is something present in
KVM but not in upstream QEmu, as a result if we boot a QEmu domain
without KVM acceleration we must disable boot=on ... even if the front
end kvm binary expose that capability in the help page.
* src/qemu/qemu_conf.c: in qemudBuildCommandLine if -no-kvm
is passed, then deactivate QEMUD_CMD_FLAG_DRIVE_BOOT
ADD_ARG_LIT should only be used for literal arguments,
since it duplicates the memory. Since virBufferContentAndReset
is already allocating memory, we should only use ADD_ARG.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
If detecting the FLR flag of a pci device fails, then we
could run into the situation of trying to close a file
descriptor twice, once in pciInitDevice() and once in pciFreeDevice().
Fix that by removing the pciCloseConfig() in pciInitDevice() and
just letting pciFreeDevice() handle it.
Thanks to Chris Wright for pointing out this problem.
While we are at it, fix an error check. While it would actually
work as-is (since success returns 0), it's still more clear to
check for < 0 (as the rest of the code does).
Signed-off-by: Chris Lalancette <clalance@redhat.com>
All <console> devices now export a <target> type attribute. QEMU defaults
to 'serial', UML defaults to 'uml, xen can be either 'serial' or 'xen'
depending on fullvirt. Understandably there is lots of test fallout.
This will be used to differentiate between a serial vs. virtio console for
QEMU.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
targetType only tracks the actual <target> format we are parsing. Currently
we only fill abide this value for channel devices.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
There is actually a difference between the character device type (serial,
parallel, channel, ...) and the target type (virtio, guestfwd). Currently
they are awkwardly conflated.
Start to pull them apart by renaming targetType -> deviceType. This is
an entirely mechanical change.
Signed-off-by: Cole Robinson <crobinso@redhat.com>
QEMU has had two different syntax for disk cache options
Old: on|off
New: writeback|writethrough|none
QEMU recently added another 'unsafe' option which broke the
libvirt check. We can avoid this & future breakage, if we
do a negative check for the old syntax, instead of a positive
check for the new syntax
* src/qemu/qemu_conf.c: Invert cache option check
Add a new element to the <os> block:
<bootmenu enable="yes|no"/>
Which maps to -boot,menu=on|off on the QEMU command line.
I decided to use an explicit 'enable' attribute rather than just make the
bootmenu element boolean. This allows us to treat lack of a bootmenu element
as 'use hypervisor default'.
When doing a PCI secondary bus reset, we must be sure that there are no
active devices on the same bus segment. The active device tracking is
designed to only track host devices that are active in use by guests.
This ignores host devices that are actively in use by the host. So the
current logic will reset host devices.
Switch this logic around and allow sbus reset when we are assigning all
devices behind a bridge to the same guest at guest startup or as a result
of a single attach-device command.
* src/util/pci.h: change signature of pciResetDevice to add an
inactive devices list
* src/qemu/qemu_driver.c src/xen/xen_driver.c: use (or not) the new
functionality of pciResetDevice() depending on the place of use
* src/util/pci.c: implement the interface and logic changes
- src/qemu/qemu_driver.c: Eliminate code duplication by using the new
helpers qemuPrepareHostdevPCIDevices and qemuDomainReAttachHostdevDevices.
This reduces the number of open coded calls to pciResetDevice.
- src/qemu/qemu_driver.c: These new helpers take hostdev list and count
directly rather than getting them indirectly from domain definition.
This will allow reuse for the attach-device case.
- src/qemu/qemu_driver.c: Update qemuGetPciHostDeviceList to take a
hostdev list and count directly, rather than getting this indirectly
from domain definition. This will allow reuse for the attach-device case.
Thanks to DV for knocking together the Relax-NG changes
quickly for me.
Changes since v1:
- Change the domain.rng to correspond to the new schema
- Don't allocate caps->ns in testQemuCapsInit since it is a static table
Changes since v2:
- Change domain.rng to add restrictions on allowed environment names
Changes since v3:
- Remove a bogus comment in the tests
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Implement the qemu driver's virDomainQemuMonitorCommand
and hook it into the API entry point.
Changes since v1:
- Rename the (external) qemuMonitorCommand to qemuDomainMonitorCommand
- Add virCheckFlags to qemuDomainMonitorCommand
Changes since v2:
- Drop ATTRIBUTE_UNUSED from the flags
Changes since v3:
- Add a flag to priv so we only print out monitor command warning once. Note
that this has not been plumbed into qemuDomainObjPrivateXMLFormat or
qemuDomainObjPrivateXMLParse, which means that if you run a monitor command,
restart libvirtd, and then run another monitor command, you may get an
an erroneous VIR_INFO. It's a pretty minor matter, and I didn't think it
warranted the additional code.
- Add BeginJob/EndJob calls around EnterMonitor/ExitMonitor
Signed-off-by: Chris Lalancette <clalance@redhat.com>