fromConfig should be true if the caller wants
virDomainPCIAddressValidate() to loosen restrictions on its
interpretation of the pciConnectFlags. In particular, either
PCI_DEVICE or PCIE_DEVICE will be counted as equivalent to both, and
HOTPLUG will be ignored. In a few cases where libvirt was manually
overriding automatic address assignment, it was setting fromConfig to
false when validating the hardcoded manual override. This patch
changes those to fromConfig=true as a preemptive strike against any
future bugs that might otherwise surface.
Although setting virDomainPCIAddressReserveAddr()'s fromConfig=true is
correct when a PCI addres is coming from a domain's config, the *true*
purpose of the fromConfig argument is to lower restrictions on what
kind of device can plug into what kind of controller - if fromConfig
is true, then a PCIE_DEVICE can plug into a slot that is marked as
only compatible with PCI_DEVICE (and vice versa), and the HOTPLUG flag
is ignored.
For a long time there have been several calls to
virDomainPCIAddressReserveAddr() that have fromConfig incorrectly set
to false - it's correct that the addresses aren't coming from user
config, but they are coming from hardcoded exceptions in libvirt that
should, if anything, pay *even less* attention to following the
pciConnectFlags (under the assumption that the libvirt programmer knew
what they were doing).
See commit b87703cf7 for an example of an actual bug caused by the
incorrect setting of the "fromConfig" argument to
virDomainPCIAddressReserveAddr(). Although they haven't resulted in
any reported bugs, this patch corrects all the other incorrect
settings of fromConfig in calls to virDomainPCIAddressReserveAddr().
If a PCI device has VIR_PCI_CONNECT_AGGREGATE_SLOT set in its
pciConnectFlags, then during address assignment we allow multiple
instances of this type of device to be auto-assigned to multiple
functions on the same device. A slot is used for aggregating multiple
devices only if the first device assigned to that slot had
VIR_PCI_CONNECT_AGGREGATE_SLOT set. but any device types that have
AGGREGATE_SLOT set might be mix/matched on the same slot.
(NB: libvirt should never set the AGGREGATE_SLOT flag for a device
type that might need to be hotplugged. Currently it is only planned
for pcie-root-port and possibly other PCI controller types, and none
of those are hotpluggable anyway)
There aren't yet any devices that use this flag. That will be in a
later patch.
If there are multiple devices assigned to the different functions of a
single PCI slot, they will not work properly if the device at function
0 doesn't have its "multi" attribute turned on, so it makes sense for
libvirt to turn it on during PCI address assignment. Setting multi
then assures that the new setting is stored in the config (so it will
be used next time the domain is started), preventing any potential
problems in the case that a future change in the configuration
eliminates the devices on all non-0 functions (multi will still be set
for function 0 even though it is the only function in use on the slot,
which has no useful purpose, but also doesn't cause any problems).
(NB: If we were to instead just decide on the setting for
multifunction at runtime, a later removal of the non-0 functions of a
slot would result in a silent change in the guest ABI for the
remaining device on function 0 (although it may seem like an
inconsequential guest ABI change, it *is* a guest ABI change to turn
off the multi bit).)
setting reserveEntireSlot really accomplishes nothing - instead of
going to the trouble of computing the value for reserveEntireSlot and
then possibly setting *all* functions of the slot as in-use, we can
just set the in-use bit only for the specific function being used by a
device. Later we will know from the context (the PCI connect flags,
and whether we are reserving a specific address or asking for "the
next available") whether or not it is okay to allocate other functions
on the same slot.
Although it's not used yet, we allow specifying "-1" for the function
number when looking for the "next available slot" - this is going to
end up meaning "return the lowest available function in the slot, but
since we currently only provide a function from an otherwise unused
slot, "-1" ends up meaning "0".
When keeping track of which functions of which slots are allocated, we
will need to have more information than just the current bitmap with a
bit for each function that is currently stored for each slot in a
virDomainPCIAddressBus. To prepare for adding more per-slot info, this
patch changes "uint8_t slots" into "virDomainPCIAddressSlot slot", which
currently has a single member named "functions" that serves the same
purpose previously served directly by "slots".
This function is used only from code compiled on Linux. Therefore
on non-Linux platforms it triggers compilation error:
../../src/qemu/qemu_domain.c:209:1: error: unused function 'qemuDomainGetPreservedMounts' [-Werror,-Wunused-function]
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
For the blockjobs, where libvirt is able to track the state internally
we can fix locking of images we can remove the appropriate locks.
Also when doing a pivoting operation we should not acquire the lock on
any of those images since both are actually locked already.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1302168
Images that became the backing chain of the current image due to the
snapshot need to be unlocked in the lock manager. Also if qemu was
paused during the snapshot the current top level images need to be
released until qemu is resumed so that they can be acquired properly.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1191901
The code at first changed the definition and then rolled it back in case
of failure. This was ridiculous. Refactor the code so that the image in
the definition is changed only when the snapshot is successful.
The refactor will also simplify further fix of image locking when doing
snapshots.
Libvirt is able to properly model what happens to the backing chain
after a snapshot so there's no real need to redetect the data.
Additionally with the _REUSE_EXT flag this might end up in redetecting
wrong data if the user puts wrong backing chain reference into the
snapshot image.
Again, there is no need to create /var/lib/libvirt/$domain.*
directories in CreateNamespace(). It is sufficient to create them
as soon as we need them which is in BuildNamespace. This way we
don't leave them around for the whole lifetime of domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The c1140eb9e got me thinking. We don't want to special case /dev
in qemuDomainGetPreservedMounts(), but in all other places in the
code we special case it anyway. I mean,
/var/run/libvirt/$domain.dev path is constructed separately just
so that it is not constructed here. It makes only a little sense
(if any at all).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
If something goes wrong in this function we try a rollback. That
is unlink all the directories we created earlier. For some weird
reason unlink() was called instead of rmdir().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far if qemu is spawned under separate mount namespace in order
to relabel everything it needs an access to the security driver
to run in that namespace too. This has a very nasty down side -
it is being run in a separate process, so any internal state
transition is NOT reflected in the daemon. This can lead to many
sleepless nights. Therefore, use the transaction APIs so that
libvirt developers can sleep tight again.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The code at the very bottom of the DAC secdriver that calls
chown() should be fine with read-only data. If something needs to
be prepared it should have been done beforehand.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
virtio-pci is the way forward for aarch64 guests: it's faster
and less alien to people coming from other architectures.
Now that guest support is finally getting there (Fedora 24,
CentOS 7.3, Ubuntu 16.04 and Debian testing all support
virtio-pci out of the box), we'd like to start using it by
default instead of virtio-mmio.
Users and applications can already opt-in by explicitly using
<address type='pci'/>
inside the relevant elements, but that's kind of cumbersome and
requires all users and management applications to adapt, which
we'd really like to avoid.
What we can do instead is use virtio-mmio only if the guest
already has at least one virtio-mmio device, and use virtio-pci
in all other situations.
That means existing virtio-mmio guests will keep using the old
addressing scheme, and new guests will automatically be created
using virtio-pci instead. Users can still override the default
in either direction.
Existing tests such as aarch64-aavmf-virtio-mmio and
aarch64-virtio-pci-default already cover all possible
scenarios, so no additions to the test suites are necessary.
When coldplugging vcpus to a VM that already has a few hotpluggable
vcpus the code might generate invalid configuration as
non-hotpluggable cpus need to be clustered starting from vcpu 0.
This fix forces the added vcpus to be hotpluggable in such case.
Fixes a corner case described in:
https://bugzilla.redhat.com/show_bug.cgi?id=1370357
This patch adds support and documentation for
a generalized hardware cache event called cache_l1d
perf event.
Signed-off-by: Nitesh Konkar <nitkon12@linux.vnet.ibm.com>
When changing the metadata via virDomainSetMetadata, we now
emit an event to notify the app of changes. This is useful
when co-ordinating different applications read/write of
custom metadata.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Separate out the "policy=discard" into it's own specific
qemu command line.
We'll rename "kvm-pit-device" test case to be "kvm-pit-discard"
since it has the syntax we'd be using.
Signed-off-by: Maxim Nestratov <mnestratov@virtuozzo.com>
By a mistake, for the VIR_DOMAIN_TIMER_TICKPOLICY_DELAY qemu
command line creation, 'discard' was used instead of 'delay'
in commit id '1569fa14'.
Test "kvm-pit-delay" is fixed accordingly to show the correct
option being generated.
Remove the (now) redundant kvm-pit-device tests. As it turns
out there is no need to specify both QEMU_CAPS_NO_KVM_PIT and
QEMU_CAPS_KVM_PIT_TICK_POLICY since they are mutually exclusive
and "kvm-pit-device" becomes just the same as "kvm-pit-delay".
Signed-off-by: Maxim Nestratov <mnestratov@virtuozzo.com>
Qemu has abandoned the +/-feature syntax in favor of key=value. Some
architectures (s390) do not support +/-feature. So we update libvirt to handle
both formats.
If we detect a sufficiently new Qemu (indicated by support for qmp
query-cpu-model-expansion) we use key=value else we fall back to +/-feature.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
When qmp query-cpu-model-expansion is available probe Qemu for its view of the
host model. In kvm environments this can provide a more complete view of the
host model because features supported by Qemu and Kvm can be considered.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
query-cpu-model-expansion is used to get a list of features for a given cpu
model name or to get the model and features of the host hardware/environment
as seen by Qemu/kvm.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Just so it doesn't bite us in the future, even though it's unlikely.
And fix the comment above it as well. Commit e08ee7cd34 took the
info from the function it's calling, but that was lie itself in the
first place.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
With my namespace patches, we are spawning qemu in its own
namespace so that we can manage /dev entries ourselves. However,
some filesystems mounted under /dev needs to be preserved in
order to be shared with the parent namespace (e.g. /dev/pts).
Currently, the list of mount points to preserve is hardcoded
which ain't right - on some systems there might be less or more
items under real /dev that on our list. The solution is to parse
/proc/mounts and fetch the list from there.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
If we restart libvirtd while VM was doing external memory snapshot, VM's
state be updated to paused as a result of running a migration-to-file
operation, and then VM will be left as paused state. In this case we must
restart the VM's CPUs to resume it.
Signed-off-by: Wang King <king.wang@huawei.com>
Commit 4b951d1e38 missed the fact that the
VM needs to be resumed after a live external checkpoint (memory
snapshot) where the cpus would be paused by the migration rather than
libvirt.
Again, not something that I'd hit, but there is a chance in
theory that this might bite us. Currently the way we decide
whether or not to create /dev entry for a device is by marching
first four characters of path with "/dev". This might be not
enough. Just imagine somebody has a disk image stored under
"/devil/path/to/disk". We ought to be matching against "/dev/".
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Not that I'd encounter any bug here, but the code doesn't look
100% correct. Imagine, somebody is trying to attach a device to a
domain, and the device's /dev entry already exists in the qemu
namespace. This is handled gracefully and the control continues
with setting up ACLs and calling security manager to set up
labels. Now, if any of these steps fail, control jump on the
'cleanup' label and unlink() the file straight away. Even when it
was not us who created the file in the first place. This can be
possibly dangerous.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=1406837
Imagine you have a domain configured in such way that you are
assigning two PCI devices that fall into the same IOMMU group.
With mount namespace enabled what happens is that for the first
PCI device corresponding /dev/vfio/X entry is created and when
the code tries to do the same for the second mknod() fails as
/dev/vfio/X already exists:
2016-12-21 14:40:45.648+0000: 24681: error :
qemuProcessReportLogError:1792 : internal error: Process exited
prior to exec: libvirt: QEMU Driver error : Failed to make device
/var/run/libvirt/qemu/windoze.dev//vfio/22: File exists
Worse, by default there are some devices that are created in the
namespace regardless of domain configuration (e.g. /dev/null,
/dev/urandom, etc.). If one of them is set as backend for some
guest device (e.g. rng, chardev, etc.) it's the same story as
described above.
Weirdly, in attach code this is already handled.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=1405269
If a secret was not provided for what was determined to be a LUKS
encrypted disk (during virStorageFileGetMetadata processing when
called from qemuDomainDetermineDiskChain as a result of hotplug
attach qemuDomainAttachDeviceDiskLive), then do not attempt to
look it up (avoiding a libvirtd crash) and do not alter the format
to "luks" when adding the disk; otherwise, the device_add would
fail with a message such as:
"unable to execute QEMU command 'device_add': Property 'scsi-hd.drive'
can't find value 'drive-scsi0-0-0-0'"
because of assumptions that when the format=luks that libvirt would have
provided the secret to decrypt the volume.
Access to unlock the volume will thus be left to the application.
virQEMUCapsSupportsChardev existing checks returns true
for spapr-vty alone. Instead verify spapr-vty validity
and let the logic to return true for other device types
so that virtio-console passes.
The non-pseries machines dont have spapr-vio-bus. So, the
function always returned false for them before.
Fixes - https://bugzilla.redhat.com/show_bug.cgi?id=1257813
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
According to commit id '0282ca45a' the 'physical' value should
essentially be the last offset of the image or the host physical
size in bytes of the image container. However, commit id '15fa84ac'
refactored the GetBlockInfo to use the same returned data as the
GetStatsBlock API for an active domain. For the 'entry->physical'
that would end up being the "actual-size" as set through the
qemuMonitorJSONBlockStatsUpdateCapacityOne (commit '7b11f5e5').
Digging deeper into QEMU code one finds that actual_size is
filled in using the same algorithm as GetBlockInfo has used for
setting the 'allocation' field when the domain is inactive.
The difference in values is seen primarily in sparse raw files
and other container type files (such as qcow2), which will return
a smaller value via the stat API for 'st_blocks'. Additionally
for container files, the 'capacity' field (populated via the
QEMU "virtual-size" value) may be slightly different (smaller)
in order to accomodate the overhead for the container. For
sparse files, the state 'st_size' field is returned.
This patch thus alters the allocation and physical values for
sparse backed storage files to be more appropriate to the API
contract. The result for GetBlockInfo is the following:
capacity: logical size in bytes of the image (how much storage
the guest will see)
allocation: host storage in bytes occupied by the image (such
as highest allocated extent if there are no holes,
similar to 'du')
physical: host physical size in bytes of the image container
(last offset, similar to 'ls')
NB: The GetStatsBlock API allows a different contract for the
values:
"block.<num>.allocation" - offset of the highest written sector
as unsigned long long.
"block.<num>.capacity" - logical size in bytes of the block device
backing image as unsigned long long.
"block.<num>.physical" - physical size in bytes of the container
of the backing image as unsigned long long.
Disk->info is not live updatable so add a check for this. Otherwise
libvirt reports success even though no data was updated.
Signed-off-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>
Reviewed-by: Bjoern Walk <bwalk@linux.vnet.ibm.com>
Reviewed-by: Boris Fiuczynski <fiuczy@linux.vnet.ibm.com>
External disk-only snapshots with recent enough qemu don't require
libvirt to pause the VM. The logic determining when to resume cpus was
slightly flawed and attempted to resume them even if they were not
paused by the snapshot code. This normally was not a problem, but with
locking enabled the code would attempt to acquire the lock twice.
The fallout of this bug would be a error from the API, but the actual
snapshot being created. The bug was introduced with when adding support
for external snapshots with memory (checkpoints) in commit f569b87.
Resolves problems described by:
https://bugzilla.redhat.com/show_bug.cgi?id=1403691
After qemu delivers the resume event it's already running and thus it's
too late to enter lockspaces since it may already have modified the
disk. The code only creates false log entries in the case when locking
is enabled. The lockspace needs to be acquired prior to starting cpus.
Given how intrusive previous patches are, it might happen that
there's a bug or imperfection. Lets give users a way out: if they
set 'namespaces' to an empty array in qemu.conf the feature is
suppressed.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>