If you try to execute two concurrent migrations p2p
from A->B and B->A, the two libvirtd's will deadlock
trying to perform the migrations. The reason for this is
that in p2p migration, the libvirtd's are responsible for
making the RPC Prepare, Migrate, and Finish calls. However,
they are currently holding the driver lock while doing so,
which basically guarantees deadlock in this scenario.
This patch fixes the situation by adding
qemuDomainObjEnterRemoteWithDriver and
qemuDomainObjExitRemoteWithDriver helper methods. The Enter
take an additional object reference, then drops both the
domain object lock and the driver lock. The Exit takes
both the driver and domain object lock, then drops the
reference. Adding calls to these Enter and Exit helpers
around remote calls in the various migration methods
seems to fix the problem for me in testing.
This should make the situation safe. The additional domain
object reference ensures that the domain object won't disappear
while this operation is happening. The BeginJob that is called
inside of qemudDomainMigratePerform ensures that we can't execute a
second migrate (or shutdown, or save, etc) job while the
migration is active. Finally, the additional check on the state
of the vm after we reacquire the locks ensures that we can't
be surprised by an external event (domain crash, etc).
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Originally the storage volume files were opened with O_DSYNC to make
sure they were flushed to disk immediately. It turned out that this
was extremely slow in some cases, so the O_DSYNC was removed in favor
of just calling fsync() after all the data had been written. However,
this call to fsync was inside the block that is executed to zero-fill
the end of the volume file. In cases where the new volume is copied
from an old volume, and they are the same length, this fsync would
never take place.
Now the fsync is *always* done, unless there is an error (in which
case it isn't important, and is most likely inappropriate.
A missing set of braces around an error condition caused us to skip
zero'ing out the remainder of a new volume file if the new volume was
longer than the original (the goto was supposed to be taken only in
the case of error, but was always being taken).
The storage volume lookup code was probing for the backing store
format, instead of using the format extracted from the file
itself. This meant it could report in accurate information. If
a format is included in the file, then use that in preference,
with probing as a fallback.
* src/storage/storage_backend_fs.c: Use extracted backing store
format
When creating qcow2 files with a backing store, it is important
to set an explicit format to prevent QEMU probing. The storage
backend was only doing this if it found a 'kvm-img' binary. This
is wrong because plenty of kvm-img binaries don't support an
explicit format, and plenty of 'qemu-img' binaries do support
a format. The result was that most qcow2 files were not getting
a backing store format.
This patch runs 'qemu-img -h' to check for the two support
argument formats
'-o backing_format=raw'
'-F raw'
and use whichever option it finds
* src/storage/storage_backend.c: Query binary to determine
how to set the backing store format
Record a default driver name/type in capabilities struct. Use this
when parsing disks if value is not set in XML config.
* src/conf/capabilities.h: Record default driver name/type for disks
* src/conf/domain_conf.c: Fallback to default driver name/type
when parsing disks
* src/qemu/qemu_driver.c: Set default driver name/type to raw
Disk format probing is now disabled by default. A new config
option in /etc/qemu/qemu.conf will re-enable it for existing
deployments where this causes trouble
The implementation of security driver callbacks often needs
to access the security driver object. Currently only a handful
of callbacks include the driver object as a parameter. Later
patches require this is many more places.
* src/qemu/qemu_driver.c: Pass in the security driver object
to all callbacks
* src/qemu/qemu_security_dac.c, src/qemu/qemu_security_stacked.c,
src/security/security_apparmor.c, src/security/security_driver.h,
src/security/security_selinux.c: Add a virSecurityDriverPtr
param to all security callbacks
Update the QEMU cgroups code, QEMU DAC security driver, SELinux
and AppArmour security drivers over to use the shared helper API
virDomainDiskDefForeachPath().
* src/qemu/qemu_driver.c, src/qemu/qemu_security_dac.c,
src/security/security_selinux.c, src/security/virt-aa-helper.c:
Convert over to use virDomainDiskDefForeachPath()
There is duplicated code which iterates over disk backing stores
performing some action. Provide a convenient helper for doing
this to eliminate duplication & risk of mistakes with disk format
probing
* src/conf/domain_conf.c, src/conf/domain_conf.h,
src/libvirt_private.syms: Add virDomainDiskDefForeachPath()
Require the disk image to be passed into virStorageFileGetMetadata.
If this is set to VIR_STORAGE_FILE_AUTO, then the format will be
resolved using probing. This makes it easier to control when
probing will be used
* src/qemu/qemu_driver.c, src/qemu/qemu_security_dac.c,
src/security/security_selinux.c, src/security/virt-aa-helper.c:
Set VIR_STORAGE_FILE_AUTO when calling virStorageFileGetMetadata.
* src/storage/storage_backend_fs.c: Probe for disk format before
calling virStorageFileGetMetadata.
* src/util/storage_file.h, src/util/storage_file.c: Remove format
from virStorageFileMeta struct & require it to be passed into
method.
The virStorageFileGetMetadataFromFD did two jobs in one. First
it probed for storage type, then it extracted metadata for the
type. It is desirable to be able to separate these jobs, allowing
probing without querying metadata, and querying metadata without
probing.
To prepare for this, split out probing code into a new pair of
methods
virStorageFileProbeFormatFromFD
virStorageFileProbeFormat
* src/util/storage_file.c, src/util/storage_file.h,
src/libvirt_private.syms: Introduce virStorageFileProbeFormat
and virStorageFileProbeFormatFromFD
Instead of including a field in FileTypeInfo struct for the
disk format, rely on the array index matching the format.
Use verify() to assert the correct number of elements in the
array.
* src/util/storage_file.c: remove type field from FileTypeInfo
When QEMU opens a backing store for a QCow2 file, it will
normally auto-probe for the format of the backing store,
rather than assuming it has the same format as the referencing
file. There is a QCow2 extension that allows an explicit format
for the backing store to be embedded in the referencing file.
This closes the auto-probing security hole in QEMU.
This backing store format can be useful for libvirt users
of virStorageFileGetMetadata, so extract this data and report
it.
QEMU does not require disk image backing store files to be in
the same format the file linkee. It will auto-probe the disk
format for the backing store when opening it. If the backing
store was intended to be a raw file this could be a security
hole, because a guest may have written data into its disk that
then makes the backing store look like a qcow2 file. If it can
trick QEMU into thinking the raw file is a qcow2 file, it can
access arbitrary files on the host by adding further backing
store links.
To address this, callers of virStorageFileGetMeta need to be
told of the backing store format. If no format is declared,
they can make a decision whether to allow format probing or
not.
IPtables will seek to preserve the source port unchanged when
doing masquerading, if possible. NFS has a pseudo-security
option where it checks for the source port <= 1023 before
allowing a mount request. If an admin has used this to make the
host OS trusted for mounts, the default iptables behaviour will
potentially allow NAT'd guests access too. This needs to be
stopped.
With this change, the iptables -t nat -L -n -v rules for the
default network will be
Chain POSTROUTING (policy ACCEPT 95 packets, 9163 bytes)
pkts bytes target prot opt in out source destination
14 840 MASQUERADE tcp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535
75 5752 MASQUERADE udp -- * * 192.168.122.0/24 !192.168.122.0/24 masq ports: 1024-65535
0 0 MASQUERADE all -- * * 192.168.122.0/24 !192.168.122.0/24
* src/network/bridge_driver.c: Add masquerade rules for TCP
and UDP protocols
* src/util/iptables.c, src/util/iptables.c: Add source port
mappings for TCP & UDP protocols when masquerading.
There are many naming conventions for partitions associated with a
block device. Some of the major ones are:
/dev/foo -> /dev/foo1
/dev/foo1 -> /dev/foo1p1
/dev/mapper/foo -> /dev/mapper/foop1
/dev/disk/by-path/foo -> /dev/disk/by-path/foo-part1
The universe of possible conventions isn't clear. Rather than trying
to understand all possible conventions, this patch divides devices
into two groups, device mapper devices and everything else. Device
mapper devices seem always to follow the convention of device ->
devicep1; everything else is canonicalized.
* src/uml/uml_driver.c (umlMonitorCommand): Correct flaw that would
cause unconditional "incomplete reply ..." failure, since "nbytes"
was always 0 or 1.
* src/qemu/qemu_driver.c (qemuConnectMonitor): Correct erroneous
parenthesization in two expressions. Without this fix, failure
to set or clear SELinux security context in the monitor would go
undiagnosed. Also correct a diagnostic and split some long lines.
When comparing a CPU without <model> element, such as
<cpu>
<topology sockets='1' cores='1' threads='1'/>
</cpu>
libvirt would happily crash without warning.
When autodetecting whether XML describes guest or host CPU, the presence
of <arch> element is checked. If it's present, we treat the XML as host
CPU definition. Which is right, since guest CPU definitions do not
contain <arch> element. However, if at the same time the root <cpu>
element contains `match' attribute, we would silently ignore it and
still treat the XML as host CPU. We should rather refuse such invalid
XML.
When a CPU to be compared with host CPU describes a host CPU instead of
a guest CPU, the result is incorrect. This is because instead of
treating additional features in host CPU description as required, they
were treated as if they were mentioned with all possible policies at the
same time.
In case qemu supports -nodefconfig, libvirt adds uses it when launching
new guests. Since this option may affect CPU models supported by qemu,
we need to use it when probing for available models.
An indentation mistake meant that a check for return status
was not properly performed in all cases. This could result
in a crash on NULL pointer in a following line.
* src/qemu/qemu_monitor_json.c: Fix check for return status
when processing JSON for blockstats
By specifying <vendor> element in CPU requirements a guest can be
restricted to run only on CPUs by a given vendor. Host CPU vendor is
also specified in capabilities XML.
The vendor is checked when migrating a guest but it's not forced, i.e.,
guests configured without <vendor> element can be freely migrated.
All features in the baseline CPU definition were always created with
policy='require' even though an arch driver returned them with different
policy settings.
This allows the user to give an explicit path to configure
./configure --with-vbox=/path/to/virtualbox
instead of having the VirtualBox driver probe a set of possible
paths at runtime. If no explicit path is specified then configure
probes the set of "known" paths.
https://bugzilla.redhat.com/show_bug.cgi?id=609185
We only use libpciaccess for resolving device product/vendor. If
initializing the library fails (say if using qemu:///session), don't
warn so loudly, and carry on as usual.
Any error message raised after the process has forked needs
to be followed by virDispatchError, otherwise we have no chance of
ever seeing it. This was selectively done for hook functions in the past,
but really applies to all post-fork errors.
As pointed out by Eric Blake, using dirent->d_type breaks
compilation on MinGW. This patch addresses this by using
'#if defined' as same as doing for virCgroupForDriver.
Some, but not all, codepaths in the qemuMonitorOpen() method
would trigger the destroy callback. The caller does not expect
this to be invoked if construction fails, only during normal
release of the monitor. This resulted in a possible double-unref
of the virDomainObjPtr, because the caller explicitly unrefs
the virDomainObjPtr if qemuMonitorOpen() fails
* src/qemu/qemu_monitor.c: Don't invoke destroy callback from
qemuMonitorOpen() failure paths
ENOENT happens normally when a subsystem is enabled with any other
subsystems and the directory of the target group has already removed
in a prior loop. In that case, the function should just return without
leaving an error message.
NB this is the same behavior as before introducing virCgroupRemoveRecursively.
Make sure to *not* call qemuDomainPCIAddressReleaseAddr if
QEMUD_CMD_FLAG_DEVICE is *not* set (for older qemu). This
prevents a crash when trying to do device detachment from
a qemu guest.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
In the current libvirt PCI code, there is no checking whether
a PCI device is in use by a guest when doing node device
detach or reattach. This causes problems when a device is
assigned to a guest, and the administrator starts issuing
nodedevice commands. Make it so that we check the list
of active devices when trying to detach/reattach, and only
allow the operation if the device is not assigned to a guest.
Signed-off-by: Chris Lalancette <clalance@redhat.com>
Fix regression introduced in commit a4a287242 - basically, the
phyp storage driver should only accept the same URIs that the
main phyp driver is willing to accept. Blindly accepting all
URIs meant that the phyp storage driver was being consulted for
'virsh -c qemu:///session pool-list --all', rather than the
qemu storage driver, then since the URI was not for phyp, attempts
to then use the phyp driver crashed because it was not initialized.
* src/phyp/phyp_driver.c (phypStorageOpen): Only accept connections
already open to a phyp driver.
This code was just recently added (by me) and didn't account for the
fact that stdin_path is sometimes NULL. If it's NULL, and
SetSecurityAllLabel fails, a segfault would result.
Because tty path is unexpectedly not saved in the live configuration
file of a domain, libvirtd cannot get the console of the domain back
after restarting.
The reason why the tty path isn't saved is that, to save the tty path,
the save function, virDomainSaveConfig, requires that the target domain
is running (pid != -1), however, lxc driver calls the function before
starting the domain to pass the configuration to controller.
To ensure to save the tty path, the patch lets lxc driver call the save
function again after starting the domain.
The function is expected to return negative value on failure,
however, it returns positive value when either setInterfaceName
or vethInterfaceUpOrDown fails. Because the function returns
the return value of either as is, however, the two functions
may return positive value on failure.
The patch fixes the defects and add error messages.
When the saved domain image is on an NFS share, at least some part of
domainSetSecurityAllLabel will fail (for example, selinux labels can't
be modified). To allow domain restore to still work in this case, just
ignore the errors.
virStorageFileIsSharedFS would previously only work if the entire path
in question was stat'able by the uid of the libvirtd process. This
patch changes it to crawl backwards up the path retrying the statfs
call until it gets to a partial path that *can* be stat'ed.
This is necessary to use the function to learn the fstype for files
stored as a different user (and readable only by that user) on a
root-squashed remote filesystem.
Also restore the label to its original value after qemu is finished
with the file.
Prior to this patch, qemu domain restore did not function properly if
selinux was set to enforce.
If an active migration operation fails, or is cancelled by the
admin, the QEMU on the destination is shutdown and the one on
the source continues running. It is important in shutting down
the QEMU on the destination, the security drivers don't reset
the file labelling/permissions.
* src/qemu/qemu_driver.c: Don't reset labelling/permissions
on migration abort
Minor speedups by using the full power of sed.
* src/phyp/phyp_driver.c (phypGetVIOSFreeSCSIAdapter)
(phypDiskType, phypListDefinedDomains): Use fewer processes, by
folding other work into sed.
(phypGetVIOSPartitionID): Likewise. Also avoid non-portable use
of 'sed -s'.