Refactor the TLS object adding code to make two separate API's that will
handle the add/remove of the "secret" and "tls-creds-x509" objects including
the Enter/Exit monitor commands.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Since qemuDomainObjExitMonitor can also generate error messages,
let's move it inside any error message saving code on error paths
for various hotplug add activities.
Signed-off-by: John Ferlan <jferlan@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=1379200
When we are restoring a domain from a saved image, or just
updating its XML in the saved image - we have to make sure that
the ABI guests sees will not change. We have a function for that
which reports errors. But for some reason if this function fails,
we call it again with slightly different argument. Therefore it
might happen that we overwrite the original error and leave user
with less helpful one.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Now that we have some qemuSecurity wrappers over
virSecurityManager APIs, lets make sure everybody sticks with
them. We have them for a reason and calling virSecurityManager
API directly instead of wrapper may lead into accidentally
labelling a file on the host instead of namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The implementation matches virStringListFreeCount. The only difference
between the two functions is the ordering of their parameters.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
The static CPU model expansion is designed to return only canonical
names of all CPU properties. To maintain backwards compatibility libvirt
is stuck with different spelling of some of the features, but we need to
use the full expansion to get the additional spellings. In addition to
returning all spelling variants for all properties the full expansion
will contain properties which are not guaranteed to be migration
compatible. Thus, we need to combine both expansions. First we need to
call the static expansion to limit the result to migratable properties.
Then we can use the result of the static expansion as an input to the
full expansion to get both canonical names and their aliases.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Querying "host" CPU model expansion only makes sense for KVM. QEMU 2.9.0
introduces a new "max" CPU model which can be used to ask QEMU what the
best CPU it can provide to a TCG domain is.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
While query-cpu-model-expansion returns only boolean features on s390,
but x86_64 reports some integer and string properties which we are
interested in.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
virQEMUCapsHasPCIMultiBus() performs a version check on
the QEMU binary to figure out whether multiple buses are
supported, so to get the correct aliases assigned when
dealing with pSeries guests we need to spoof the version
accordingly in the test suite.
Due to the extra architecture-specific logic, it's already
necessary for users to call virQEMUCapsHasPCIMultiBus(),
so the capability itself is just a pointless distraction.
Our documentation states that the chardev logging file is truncated
unless append='on' is specified. QEMU also behaves the same way and
truncates the file unless we provide the argument. The new virlogd
implementation did not honor if the argument was missing and continued
to append to the file.
Truncate the file even when the 'append' attribute is not present to
behave the same with both implementations and adhere to the docs.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1420205
After eca76884ea in case of error in qemuDomainSetPrivatePaths()
in pretended start we jump to stop. I've changed this during
review from 'cleanup' which turned out to be correct. Well, sort
of. We can't call qemuProcessStop() as it decrements
driver->nactive and we did not increment it. However, it calls
virDomainObjRemoveTransientDef() which is basically the only
function we need to call. So call that function and goto cleanup;
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The CPU driver provides APIs to create and free virCPUDataPtr. Thus all
APIs exported from the driver should work with that rather than
requiring the caller to pass a pointer to an internal part of the
structure.
In other words
virCPUx86DataAddCPUID(cpudata, &cpuid)
is much better than the original
virCPUx86DataAddCPUID(&cpudata->data.x86, &cpuid)
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
The new API is called virCPUDataFree. Individual CPU drivers are no
longer required to implement their own freeing function unless they need
to free architecture specific data from virCPUData.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Our documentation of the domain capabilities XML says that the fallback
attribute of a CPU model is used to indicate whether the CPU model was
detected by libvirt itself (fallback="allow") or by asking the
hypervisor (fallback="forbid"). We need to properly set
fallback="forbid" when CPU model comes from QEMU to match the
documentation.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
The port is stored in graphics configuration and it will
also get released in qemuProcessStop in case of error.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1397440
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Previously the code called virStorageSourceUpdateBlockPhysicalSize which
did not do anything on empty drives since it worked only on block
devices. After the refactor in c5f6151390 it's called for all devices
and thus attempts to deref the NULL path of empty drives.
Add a check that skips the update of the physical size if the storage
source is empty.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1420718
Fix incorrect jump labels in error paths as the stop jump is only
needed if the driver has already changed the state. For example
'virAtomicIntInc(&driver->nactive)' will be 'reverted' in the
qemuProcessStop call.
Signed-off-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>
Reviewed-by: Boris Fiuczynski <fiuczy@linux.vnet.ibm.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When a domain needs an access to some device (be it a disk, RNG,
chardev, whatever), we have to allow it in the devices CGroup (if
it is available), because by default we disallow all the devices.
But some of the functions that are responsible for setting up
devices CGroup are lacking check whether there is any CGroup
available. Thus users might be unable to hotplug some devices:
virsh # attach-device fedora rng.xml
error: Failed to attach device from rng.xml
error: internal error: Controller 'devices' is not mounted
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
One of the conditions in qemuDomainDeviceCalculatePCIConnectFlags
was missing a break that could result it in falling through to
an incorrect codepath.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
qemuDomainAssignPCIAddresses() hardcoded the assumption
that the only way to support devices on a non-zero bus is
to add one or more pci-bridges; however, since we now
support a large selection of PCI controllers that can be
used instead, the assumption is no longer true.
Moreover, this check was always redundant, because the
only sensible time to check for the availability of
pci-bridge is when building the QEMU command line, and
such a check is of course already in place.
In fact, there were *two* such checks, but since one of
the two was relying on the incorrect assumption explained
above, and it was redundant anyway, it has been dropped.
When switching over the values in the virDomainControllerModelPCI
enumeration, make sure the proper cast is in place so that the
compiler can warn us when the coverage is not exaustive.
For the same reason, fold some unstructured checks (performed by
comparing directly against some values in the enumeration) inside
an existing switch statement.
The CPU model info formating code in virQEMUCapsFormatCache will get
more complicated soon. Separating the code in
virQEMUCapsFormatHostCPUModelInfo will make the result easier to read.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
All features the function is currently supposed to filter out are
specific to x86_64. We should avoid removing them on other
architectures. It seems to be quite unlikely other achitectures would
use the same names, but one can never be sure.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
The functions in virCommand() after fork() must be careful with regard
to accessing any mutexes that may have been locked by other threads in
the parent process. It is possible that another thread in the parent
process holds the lock for the virQEMUDriver while fork() is called.
This leads to a deadlock in the child process when
'virQEMUDriverGetConfig(driver)' is called and therefore the handshake
never completes between the child and the parent process. Ultimately
the virDomainObjectPtr will never be unlocked.
It gets much worse if the other thread of the parent process, that
holds the lock for the virQEMUDriver, tries to lock the already locked
virDomainObject. This leads to a completely unresponsive libvirtd.
It's possible to reproduce this case with calling 'virsh start XXX'
and 'virsh managedsave XXX' in a tight loop for multiple domains.
This commit fixes the deadlock in the same way as it is described in
commit 61b52d2e38.
Signed-off-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>
Reviewed-by: Boris Fiuczynski <fiuczy@linux.vnet.ibm.com>
With that users could access files outside /dev/shm. That itself
isn't a security problem, but might cause some errors we want to
avoid. So let's forbid slashes as we do with domain and volume names
and also mention that in the schema.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1395496
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
This follows the same check for disk, because we cannot remove iothread
if it's used by disk or by controller. It could lead to crashing QEMU.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
If virDomainDelIOThread API was called with VIR_DOMAIN_AFFECT_LIVE
and VIR_DOMAIN_AFFECT_CONFIG and both XML were already a different
it could result in removing iothread from config XML even if there
was a disk using that iothread.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
The situation covered by the removed code will not ever happen.
This code is called only while starting a new QEMU process where
the capabilities where already checked and while attaching to
existing QEMU process where we don't even detect the iothreads.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
When enabling virgl, qemu opens /dev/dri/render*. So far, we are
not allowing that in devices CGroup nor creating the file in
domain's namespace and thus requiring users to set the paths in
qemu.conf. This, however, is suboptimal as it allows access to
ALL qemu processes even those which don't have virgl configured.
Now that we have a way to specify render node that qemu will use
we can be more cautious and enable just that.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far, qemuDomainGetHostdevPath has no knowledge of the reasong
it is called and thus reports /dev/vfio/vfio for every VFIO
backed device. This is suboptimal, as we want it to:
a) report /dev/vfio/vfio on every addition or domain startup
b) report /dev/vfio/vfio only on last VFIO device being unplugged
If a domain is being stopped then namespace and CGroup die with
it so no need to worry about that. I mean, even when a domain
that's exiting has more than one VFIO devices assigned to it,
this function does not clean /dev/vfio/vfio in CGroup nor in the
namespace. But that doesn't matter.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
So far, we are allowing /dev/vfio/vfio in the devices cgroup
unconditionally (and creating it in the namespace too). Even if
domain has no hostdev assignment configured. This is potential
security hole. Therefore, when starting the domain (or
hotplugging a hostdev) create & allow /dev/vfio/vfio too (if
needed).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Since these two functions are nearly identical (with
qemuSetupHostdevCgroup actually calling virCgroupAllowDevicePath)
we can have one function call the other and thus de-duplicate
some code.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
There's no need for this function. Currently it is passed as a
callback to virSCSIVHostDeviceFileIterate(). However, SCSI host
devices have just one file path. Therefore we can mimic approach
used in qemuDomainGetHostdevPath() to get path and call
virCgroupAllowDevicePath() directly.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
There's no need for this function. Currently it is passed as a
callback to virSCSIDeviceFileIterate(). However, SCSI devices
have just one file path. Therefore we can mimic approach used in
qemuDomainGetHostdevPath() to get path and call
virCgroupAllowDevicePath() directly.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
There's no need for this function. Currently it is passed as a
callback to virUSBDeviceFileIterate(). However, USB devices have
just one file path. Therefore we can mimic approach used in
qemuDomainGetHostdevPath() to get path and call
virCgroupAllowDevicePath() directly.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Add a new attribute 'rendernode' to <gl> spice element.
Give it to QEMU if qemu supports it (queued for 2.9).
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
This function is returning a boolean therefore check for '< 0'
makes no sense. It should have been
'!qemuDomainNamespaceAvailable'.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The bare fact that mnt namespace is available is not enough for
us to allow/enable qemu namespaces feature. There are other
requirements: we must copy all the ACL & SELinux labels otherwise
we might grant access that is administratively forbidden or vice
versa.
At the same time, the check for namespace prerequisites is moved
from domain startup time to qemu.conf parser as it doesn't make
much sense to allow users to start misconfigured libvirt just to
find out they can't start a single domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Commit 2a8d40f4ec refactored qemuMonitorJSONGetCPUx86Data and replaced
virJSONValueObjectGet(reply, "return") with virJSONValueObjectGetArray.
While the former is guaranteed to always return non-NULL pointer the
latter may return NULL if the returned JSON object is not an array.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
mknod() is affected my the current umask, so we're not
guaranteed the newly-created device node will have the
right permissions.
Call chmod(), which is not affected by the current umask,
immediately afterwards to solve the issue.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1421036
Due to a logic error, the autofilling of USB port when a bus is
specified:
<address type='usb' bus='0'/>
does not work for non-hub devices on domain startup.
Fix the logic in qemuDomainAssignUSBPortsIterator to also
assign ports for USB addresses that do not yet have one.
https://bugzilla.redhat.com/show_bug.cgi?id=1374128
==11846== 240 bytes in 1 blocks are definitely lost in loss record 81 of 107
==11846== at 0x4C2BC75: calloc (vg_replace_malloc.c:624)
==11846== by 0x18C74242: virAllocN (viralloc.c:191)
==11846== by 0x4A05E8: qemuMonitorCPUModelInfoCopy (qemu_monitor.c:3677)
==11846== by 0x446E3C: virQEMUCapsNewCopy (qemu_capabilities.c:2171)
==11846== by 0x437335: testQemuCapsCopy (qemucapabilitiestest.c:108)
==11846== by 0x437CD2: virTestRun (testutils.c:180)
==11846== by 0x437AD8: mymain (qemucapabilitiestest.c:176)
==11846== by 0x4397B6: virTestMain (testutils.c:992)
==11846== by 0x437B44: main (qemucapabilitiestest.c:188)
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Check if virQEMUCapsNewCopy(...) has failed, thus a segmentation fault
in virQEMUCapsFilterByMachineType(...) will be avoided.
Signed-off-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>
Reviewed-by: Bjoern Walk <bwalk@linux.vnet.ibm.com>
Using libvirt to do live migration over RDMA via IPv6 address failed.
For example:
rhel73_host1_guest1 qemu+ssh://[deba::2222]/system --verbose
root@deba::2222's password:
error: internal error: unable to execute QEMU command 'migrate': RDMA
ERROR: could not rdma_getaddrinfo address deba
As we can see, the IPv6 address used by rdma_getaddrinfo() has only
"deba" part because we didn't properly enclose the IPv6 address in []
and passed rdma:deba::2222:49152 as the migration URI in
qemuMonitorMigrateToHost.
Signed-off-by: David Dai <zdai@linux.vnet.ibm.com>
This patch add support for file memory backing on numa topology.
The specified access mode in memoryBacking can be overriden
by specifying token memAccess in numa cell.
Add new parameter memory_backing_dir where files will be stored when memoryBacking
source is selected as file.
Value is stored inside char* memoryBackingDir
Rename to avoid duplicate code. Because virDomainMemoryAccess will be
used in memorybacking for setting default behaviour.
NOTE: The enum cannot be moved to qemu/domain_conf because of headers
dependency
Just like we need wrappers over other virSecurityManager APIs, we
need one for virSecurityManagerSetImageLabel and
virSecurityManagerRestoreImageLabel. Otherwise we might end up
relabelling device in wrong namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Firstly, instead of checking for next->path the
virStorageSourceIsEmpty() function should be used which also
takes disk type into account.
Secondly, not every disk source passed has the correct type set
(due to our laziness). Therefore, instead of checking for
virStorageSourceIsBlockLocal() and also S_ISBLK() the former can
be refined to just virStorageSourceIsLocalStorage().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Again, one missed bit. This time without this commit there is no
/dev entry in the namespace of the qemu process when doing disk
snapshots or block-copy.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
These functions do not need to see the whole virDomainDiskDef.
Moreover, they are going to be called from places where we don't
have access to the full disk definition. Sticking with
virStorageSource is more than enough.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
There is no need for this. None of the namespace helpers uses it.
Historically it was used when calling secdriver APIs, but we
don't to that anymore.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Again, one missed bit. This time without this commit there is no
/dev entry in the namespace of the qemu process when attaching
vhost SCSI device.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Since we have qemuSecurity wrappers over
virSecurityManagerSetHostdevLabel and
virSecurityManagerRestoreHostdevLabel we ought to use them
instead of calling secdriver APIs directly. Without those
wrappers the labelling won't be done in the correct namespace
and thus won't apply to the nodes seen by qemu itself.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
libvirt was able to set the host_mtu option when an MTU was explicitly
given in the interface config (with <mtu size='n'/>), set the MTU of a
libvirt network in the network config (with the same named
subelement), and would automatically set the MTU of any tap device to
the MTU of the network.
This patch ties that all together (for networks based on tap devices
and either Linux host bridges or OVS bridges) by learning the MTU of
the network (i.e. the bridge) during qemuInterfaceBridgeConnect(), and
returning that value so that it can then be passed to
qemuBuildNicDevStr(); qemuBuildNicDevStr() then sets host_mtu in the
interface's commandline options.
The result is that a higher MTU for all guests connecting to a
particular network will be plumbed top to bottom by simply changing
the MTU of the network (in libvirt's config for libvirt-managed
networks, or directly on the bridge device for simple host bridges or
OVS bridges managed outside of libvirt).
One question I have about this - it occurred to me that in the case of
migrating a guest from a host with an older libvirt to one with a
newer libvirt, the guest may have *not* had the host_mtu option on the
older machine, but *will* have it on the newer machine. I'm curious if
this could lead to incompatibilities between source and destination (I
guess it all depends on whether or not the setting of host_mtu has a
practical effect on a guest that is already running - Maxime?)
Likewise, we could run into problems when migrating from a newer
libvirt to older libvirt - The guest would have been told of the
higher MTU on the newer libvirt, then migrated to a host that didn't
understand <mtu size='blah'/>. (If this really is a problem, it would
be a problem with or without the current patch).
virNetDevTapCreateInBridgePort() has always set the new tap device to
the current MTU of the bridge it's being attached to. There is one
case where we will want to set the new tap device to a different
(usually larger) MTU - if that's done with the very first device added
to the bridge, the bridge's MTU will be set to the device's MTU. This
patch allows for that possibility by adding "int mtu" to the arg list
for virNetDevTapCreateInBridgePort(), but all callers are sending -1,
so it doesn't yet have any effect.
Since the requested MTU isn't necessarily what is used in the end (for
example, if there is no MTU requested, the tap device will be set to
the current MTU of the bridge), and the hypervisor may want to know
the actual MTU used, we also return the actual MTU to the caller (if
actualMTU is non-NULL).
In order for memory locking to work, the hard limit on memory
locking (and usage) has to be set appropriately by the user.
The documentation mentions the requirement already: with this
patch, it's going to be enforced by runtime checks as well,
by forbidding a non-compliant guest from being defined as well
as edited and started.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1316774
Similarly to one of the previous commits, we need to deal
properly with symlinks in hotplug case too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Imagine you have a disk with the following source set up:
/dev/disk/by-uuid/$uuid (symlink to) -> /dev/sda
After cbc45525cb the transitive end of the symlink chain is
created (/dev/sda), but we need to create any item in chain too.
Others might rely on that.
In this case, /dev/disk/by-uuid/$uuid comes from domain XML thus
it is this path that secdriver tries to relabel. Not the resolved
one.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
After previous commit this has become redundant step.
Also setting up devices in namespace and setting their label
later on are two different steps and should be not done at once.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The idea is to move all the seclabel setting to security driver.
Having the relabel code spread all over the place looks very
messy.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Because of the nature of security driver transactions, it is
impossible to use them properly. The thing is, transactions enter
the domain namespace and commit all the seclabel changes.
However, in RestoreAllLabel() this is impossible - the qemu
process, the only process running in the namespace, is gone. And
thus is the namespace. Therefore we shouldn't use the transactions
as there is no namespace to enter.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The current ordering is as follows:
1) set label
2) create the device in namespace
3) allow device in the cgroup
While this might work for now, it will definitely not work if the
security driver would use transactions as in that case there
would be no device to relabel in the domain namespace as the
device is created in the second step.
Swap steps 1) and 2) to allow security driver to use more
transactions.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Now that we have a function for properly assigning the blockdeviotune
info, let's use it instead of dropping the group name on every
assignment. Otherwise it will not work with both --live and --config
options.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Commit 815d98a started auto-adding one hub if there are more USB devices
than available USB ports.
This was a strange choice, since there might be even more devices.
Before USB address allocation was implemented in libvirt, QEMU
automatically added a new USB hub if the old one was full.
Adjust the logic to try adding as many hubs as will be needed
to plug in all the specified devices.
https://bugzilla.redhat.com/show_bug.cgi?id=1410188
==12618== 110 bytes in 10 blocks are definitely lost in loss record 269 of 295
==12618== at 0x4C2AE5F: malloc (vg_replace_malloc.c:297)
==12618== by 0x1CFC6DD7: vasprintf (vasprintf.c:73)
==12618== by 0x1912B2FC: virVasprintfInternal (virstring.c:551)
==12618== by 0x1912B411: virAsprintfInternal (virstring.c:572)
==12618== by 0x50B1FF: qemuAliasChardevFromDevAlias (qemu_alias.c:638)
==12618== by 0x518CCE: qemuBuildChrChardevStr (qemu_command.c:4973)
==12618== by 0x522DA0: qemuBuildShmemBackendChrStr (qemu_command.c:8674)
==12618== by 0x523209: qemuBuildShmemCommandLine (qemu_command.c:8789)
==12618== by 0x526135: qemuBuildCommandLine (qemu_command.c:9843)
==12618== by 0x48B4BA: qemuProcessCreatePretendCmd (qemu_process.c:5897)
==12618== by 0x4378C9: testCompareXMLToArgv (qemuxml2argvtest.c:498)
==12618== by 0x44D5A6: virTestRun (testutils.c:180)
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
For example when both total_bytes_sec and total_bytes_sec_max are set,
but the former gets cleaned due to new call setting, let's say,
read_bytes_sec, we end up with this weird message for the command:
$ virsh blkdeviotune fedora vda --read-bytes-sec 3000
error: Unable to change block I/O throttle
error: unsupported configuration: value 'total_bytes_sec_max' cannot be set if 'total_bytes_sec' is not set
So let's make it more descriptive. This is how it looks after the change:
$ virsh blkdeviotune fedora vda --read-bytes-sec 3000
error: Unable to change block I/O throttle
error: unsupported configuration: cannot reset 'total_bytes_sec' when 'total_bytes_sec_max' is set
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1344897
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
We were setting it based on whether it was supported and that lead to
setting it to NULL, which our JSON code caught. However it ended up
producing the following results:
$ virsh blkdeviotune fedora vda --total-bytes-sec-max 2000
error: Unable to change block I/O throttle
error: internal error: argument key 'group' must not have null value
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Not only we should set the MTU on the host end of the device but
also let qemu know what MTU did we set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far we allow to set MTU for libvirt networks. However, not all
domain interfaces have to be plugged into a libvirt network and
even if they are, they might want to have a different MTU (e.g.
for testing purposes).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
We lacked of timestamp in tainting of guests log,
which bring troubles for finding guest issues:
such as whether a guest powerdown caused by qemu-monitor-command
or others issues inside guests.
If we had timestamp in tainting of guests log,
it would be helpful when checking guest's /var/log/messages.
Signed-off-by: Chen Hanxiao <chenhanxiao@gmail.com>
Based on work of Mehdi Abaakouk <sileht@sileht.net>.
When parsing vhost-user interface XML and no ifname is found we
can try to fill it in in post parse callback. The way this works
is we try to make up interface name from given socket path and
then ask openvswitch whether it knows the interface.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The event needs to be emitted after the last monitor call, so that it's
not possible to find the device in the XML accidentally while the vm
object is unlocked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1414393
Previously when QEMU failed "drive_add" due to an error opening
a file it would report
"could not open disk image"
These days though, QEMU reports
"Could not open '/tmp/virtd-test_e3hnhh5/disk1.qcow2': Permission denied"
which we were not detecting as an error condition.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Extract the call to qemuDomainSelectHotplugVcpuEntities outside of
qemuDomainSetVcpusLive and decide whether to hotplug or unplug the
entities specified by the cpumap using a boolean flag.
This will allow to use qemuDomainSetVcpusLive in cases where we prepare
the list of vcpus to enable or disable by other means.
In cases where CPU hotplug is supported by qemu force the monitor to
reject invalid or broken responses to 'query-cpus'. It's expected that
the command returns usable data in such case.
https://bugzilla.redhat.com/show_bug.cgi?id=1413922
While all the code that deals with qemu namespaces correctly
detects whether we are running as root (and turn into NO-OP for
qemu:///session) the actual unshare() call is not guarded with
such check. Therefore any attempt to start a domain under
qemu:///session shall fail as unshare() is reserved for root.
The fix consists of moving unshare() call (for which we have a
wrapper called virProcessSetupPrivateMountNS) into
qemuDomainBuildNamespace() where the proper check is performed.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
When running on s390 with a kernel that does not support cpu model checking and
with a Qemu new enough to support query-cpu-model-expansion, the gathering of qemu
capabilities will fail. Qemu responds to the query-cpu-model-expansion qmp
command with an error because the needed kernel ioct does not exist. When this
happens a guest cannot even be defined due to missing qemu capabilities data.
This patch fixes the problem by silently ignoring generic errors stemming from
calls to query-cpu-model-expansion.
Reported-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
When creating new /dev/* for qemu, we do chown() and copy ACLs to
create the exact copy from the original /dev. I though that
copying SELinux labels is not necessary as SELinux will chose the
sane defaults. Surprisingly, it does not leaving namespace with
the following labels:
crw-rw-rw-. root root system_u:object_r:tmpfs_t:s0 random
crw-------. root root system_u:object_r:tmpfs_t:s0 rtc0
drwxrwxrwt. root root system_u:object_r:tmpfs_t:s0 shm
crw-rw-rw-. root root system_u:object_r:tmpfs_t:s0 urandom
As a result, domain is unable to start:
error: internal error: process exited while connecting to monitor: Error in GnuTLS initialization: Failed to acquire random data.
qemu-kvm: cannot initialize crypto: Unable to initialize GNUTLS library: Failed to acquire random data.
The solution is to copy the SELinux labels as well.
Reported-by: Andrea Bolognani <abologna@redhat.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The query-cpu-model-expansion is currently implemented for s390(x) only
and all CPU properties it returns are booleans. However, x86
implementation will report more types of properties. Without making the
code more tolerant older libvirt would fail to probe newer QEMU
versions.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
The qemuMonitorJSONParseCPUModelProperty function is a callback for
virJSONValueObjectForeachKeyValue and is called for each key/value pair,
thus it doesn't really make sense to check whether key is NULL.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
So far the decision whether /dev/* entry is created in the qemu
namespace is really simple: does the path starts with "/dev/"?
This can be easily fooled by providing path like the following
(for any considered device like disk, rng, chardev, ..):
/dev/../var/lib/libvirt/images/disk.qcow2
Therefore, before making the decision the path should be
canonicalized.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far the namespaces were turned on by default unconditionally.
For all non-Linux platforms we provided stub functions that just
ignored whatever namespaces setting there was in qemu.conf and
returned 0 to indicate success. Moreover, we didn't really check
if namespaces are available on the host kernel.
This is suboptimal as we might have ignored user setting.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
This is a simple wrapper over mount(). However, not every system
out there is capable of moving a mount point. Therefore, instead
of having to deal with this fact in all the places of our code we
can have a simple wrapper and deal with this fact at just one
place.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Due to a copy-paste error, the debug message reads:
Setting up disks
It should have been:
Setting up inputs.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Surprisingly there was a virDomainPCIAddressReleaseAddr() function
already, but it was completely unused. Since we don't reserve entire
slots at once any more, there is no need to release entire slots
either, so we just replace the single call to
virDomainPCIAddressReleaseSlot() with a call to
virDomainPCIAddressReleaseAddr() and remove the now unused function.
The keen observer may be concerned that ...Addr() doesn't call
virDomainPCIAddressValidate(), as ...Slot() did. But really the
validation was pointless anyway - if the device hadn't been suitable
to be connected at that address, it would have failed validation
before every being reserved in the first place, so by definition it
will pass validation when it is being unplugged. (And anyway, even if
something "bad" happened and we managed to have a device incorrectly
at the given address, we would still want to be able to free it up for
use by a device that *did* validate properly).
This function is only called in two places, and the function itself is
just adding a single argument and calling
virDomainPCIAddressReserveNextAddr(), so we can remove it and instead
call virDomainPCIAddressReserveNextAddr() directly. (The main
motivation for doing this is to free up the name so that
qemuDomainPCIAddressReserveNextSlot() can be renamed in the next
patch, as its current name is now inaccurate and misleading).
All occurences of the former use fromConfig=true, and that's exactly
how virDomainPCIAddressReserveSlot() calls
virDomainPCIaddressReserveAddr(), so just use *Slot() so that *Addr()
can be made static to conf/domain_addr.c (both functions will be
renamed in upcoming patches).
fromConfig should be true if the caller wants
virDomainPCIAddressValidate() to loosen restrictions on its
interpretation of the pciConnectFlags. In particular, either
PCI_DEVICE or PCIE_DEVICE will be counted as equivalent to both, and
HOTPLUG will be ignored. In a few cases where libvirt was manually
overriding automatic address assignment, it was setting fromConfig to
false when validating the hardcoded manual override. This patch
changes those to fromConfig=true as a preemptive strike against any
future bugs that might otherwise surface.
Although setting virDomainPCIAddressReserveAddr()'s fromConfig=true is
correct when a PCI addres is coming from a domain's config, the *true*
purpose of the fromConfig argument is to lower restrictions on what
kind of device can plug into what kind of controller - if fromConfig
is true, then a PCIE_DEVICE can plug into a slot that is marked as
only compatible with PCI_DEVICE (and vice versa), and the HOTPLUG flag
is ignored.
For a long time there have been several calls to
virDomainPCIAddressReserveAddr() that have fromConfig incorrectly set
to false - it's correct that the addresses aren't coming from user
config, but they are coming from hardcoded exceptions in libvirt that
should, if anything, pay *even less* attention to following the
pciConnectFlags (under the assumption that the libvirt programmer knew
what they were doing).
See commit b87703cf7 for an example of an actual bug caused by the
incorrect setting of the "fromConfig" argument to
virDomainPCIAddressReserveAddr(). Although they haven't resulted in
any reported bugs, this patch corrects all the other incorrect
settings of fromConfig in calls to virDomainPCIAddressReserveAddr().
If a PCI device has VIR_PCI_CONNECT_AGGREGATE_SLOT set in its
pciConnectFlags, then during address assignment we allow multiple
instances of this type of device to be auto-assigned to multiple
functions on the same device. A slot is used for aggregating multiple
devices only if the first device assigned to that slot had
VIR_PCI_CONNECT_AGGREGATE_SLOT set. but any device types that have
AGGREGATE_SLOT set might be mix/matched on the same slot.
(NB: libvirt should never set the AGGREGATE_SLOT flag for a device
type that might need to be hotplugged. Currently it is only planned
for pcie-root-port and possibly other PCI controller types, and none
of those are hotpluggable anyway)
There aren't yet any devices that use this flag. That will be in a
later patch.
If there are multiple devices assigned to the different functions of a
single PCI slot, they will not work properly if the device at function
0 doesn't have its "multi" attribute turned on, so it makes sense for
libvirt to turn it on during PCI address assignment. Setting multi
then assures that the new setting is stored in the config (so it will
be used next time the domain is started), preventing any potential
problems in the case that a future change in the configuration
eliminates the devices on all non-0 functions (multi will still be set
for function 0 even though it is the only function in use on the slot,
which has no useful purpose, but also doesn't cause any problems).
(NB: If we were to instead just decide on the setting for
multifunction at runtime, a later removal of the non-0 functions of a
slot would result in a silent change in the guest ABI for the
remaining device on function 0 (although it may seem like an
inconsequential guest ABI change, it *is* a guest ABI change to turn
off the multi bit).)
setting reserveEntireSlot really accomplishes nothing - instead of
going to the trouble of computing the value for reserveEntireSlot and
then possibly setting *all* functions of the slot as in-use, we can
just set the in-use bit only for the specific function being used by a
device. Later we will know from the context (the PCI connect flags,
and whether we are reserving a specific address or asking for "the
next available") whether or not it is okay to allocate other functions
on the same slot.
Although it's not used yet, we allow specifying "-1" for the function
number when looking for the "next available slot" - this is going to
end up meaning "return the lowest available function in the slot, but
since we currently only provide a function from an otherwise unused
slot, "-1" ends up meaning "0".
When keeping track of which functions of which slots are allocated, we
will need to have more information than just the current bitmap with a
bit for each function that is currently stored for each slot in a
virDomainPCIAddressBus. To prepare for adding more per-slot info, this
patch changes "uint8_t slots" into "virDomainPCIAddressSlot slot", which
currently has a single member named "functions" that serves the same
purpose previously served directly by "slots".
This function is used only from code compiled on Linux. Therefore
on non-Linux platforms it triggers compilation error:
../../src/qemu/qemu_domain.c:209:1: error: unused function 'qemuDomainGetPreservedMounts' [-Werror,-Wunused-function]
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
For the blockjobs, where libvirt is able to track the state internally
we can fix locking of images we can remove the appropriate locks.
Also when doing a pivoting operation we should not acquire the lock on
any of those images since both are actually locked already.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1302168
Images that became the backing chain of the current image due to the
snapshot need to be unlocked in the lock manager. Also if qemu was
paused during the snapshot the current top level images need to be
released until qemu is resumed so that they can be acquired properly.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1191901
The code at first changed the definition and then rolled it back in case
of failure. This was ridiculous. Refactor the code so that the image in
the definition is changed only when the snapshot is successful.
The refactor will also simplify further fix of image locking when doing
snapshots.
Libvirt is able to properly model what happens to the backing chain
after a snapshot so there's no real need to redetect the data.
Additionally with the _REUSE_EXT flag this might end up in redetecting
wrong data if the user puts wrong backing chain reference into the
snapshot image.
Again, there is no need to create /var/lib/libvirt/$domain.*
directories in CreateNamespace(). It is sufficient to create them
as soon as we need them which is in BuildNamespace. This way we
don't leave them around for the whole lifetime of domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The c1140eb9e got me thinking. We don't want to special case /dev
in qemuDomainGetPreservedMounts(), but in all other places in the
code we special case it anyway. I mean,
/var/run/libvirt/$domain.dev path is constructed separately just
so that it is not constructed here. It makes only a little sense
(if any at all).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
If something goes wrong in this function we try a rollback. That
is unlink all the directories we created earlier. For some weird
reason unlink() was called instead of rmdir().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far if qemu is spawned under separate mount namespace in order
to relabel everything it needs an access to the security driver
to run in that namespace too. This has a very nasty down side -
it is being run in a separate process, so any internal state
transition is NOT reflected in the daemon. This can lead to many
sleepless nights. Therefore, use the transaction APIs so that
libvirt developers can sleep tight again.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
The code at the very bottom of the DAC secdriver that calls
chown() should be fine with read-only data. If something needs to
be prepared it should have been done beforehand.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
virtio-pci is the way forward for aarch64 guests: it's faster
and less alien to people coming from other architectures.
Now that guest support is finally getting there (Fedora 24,
CentOS 7.3, Ubuntu 16.04 and Debian testing all support
virtio-pci out of the box), we'd like to start using it by
default instead of virtio-mmio.
Users and applications can already opt-in by explicitly using
<address type='pci'/>
inside the relevant elements, but that's kind of cumbersome and
requires all users and management applications to adapt, which
we'd really like to avoid.
What we can do instead is use virtio-mmio only if the guest
already has at least one virtio-mmio device, and use virtio-pci
in all other situations.
That means existing virtio-mmio guests will keep using the old
addressing scheme, and new guests will automatically be created
using virtio-pci instead. Users can still override the default
in either direction.
Existing tests such as aarch64-aavmf-virtio-mmio and
aarch64-virtio-pci-default already cover all possible
scenarios, so no additions to the test suites are necessary.
When coldplugging vcpus to a VM that already has a few hotpluggable
vcpus the code might generate invalid configuration as
non-hotpluggable cpus need to be clustered starting from vcpu 0.
This fix forces the added vcpus to be hotpluggable in such case.
Fixes a corner case described in:
https://bugzilla.redhat.com/show_bug.cgi?id=1370357
This patch adds support and documentation for
a generalized hardware cache event called cache_l1d
perf event.
Signed-off-by: Nitesh Konkar <nitkon12@linux.vnet.ibm.com>
When changing the metadata via virDomainSetMetadata, we now
emit an event to notify the app of changes. This is useful
when co-ordinating different applications read/write of
custom metadata.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Separate out the "policy=discard" into it's own specific
qemu command line.
We'll rename "kvm-pit-device" test case to be "kvm-pit-discard"
since it has the syntax we'd be using.
Signed-off-by: Maxim Nestratov <mnestratov@virtuozzo.com>
By a mistake, for the VIR_DOMAIN_TIMER_TICKPOLICY_DELAY qemu
command line creation, 'discard' was used instead of 'delay'
in commit id '1569fa14'.
Test "kvm-pit-delay" is fixed accordingly to show the correct
option being generated.
Remove the (now) redundant kvm-pit-device tests. As it turns
out there is no need to specify both QEMU_CAPS_NO_KVM_PIT and
QEMU_CAPS_KVM_PIT_TICK_POLICY since they are mutually exclusive
and "kvm-pit-device" becomes just the same as "kvm-pit-delay".
Signed-off-by: Maxim Nestratov <mnestratov@virtuozzo.com>
Qemu has abandoned the +/-feature syntax in favor of key=value. Some
architectures (s390) do not support +/-feature. So we update libvirt to handle
both formats.
If we detect a sufficiently new Qemu (indicated by support for qmp
query-cpu-model-expansion) we use key=value else we fall back to +/-feature.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
When qmp query-cpu-model-expansion is available probe Qemu for its view of the
host model. In kvm environments this can provide a more complete view of the
host model because features supported by Qemu and Kvm can be considered.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
query-cpu-model-expansion is used to get a list of features for a given cpu
model name or to get the model and features of the host hardware/environment
as seen by Qemu/kvm.
Signed-off-by: Collin L. Walling <walling@linux.vnet.ibm.com>
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Just so it doesn't bite us in the future, even though it's unlikely.
And fix the comment above it as well. Commit e08ee7cd34 took the
info from the function it's calling, but that was lie itself in the
first place.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
With my namespace patches, we are spawning qemu in its own
namespace so that we can manage /dev entries ourselves. However,
some filesystems mounted under /dev needs to be preserved in
order to be shared with the parent namespace (e.g. /dev/pts).
Currently, the list of mount points to preserve is hardcoded
which ain't right - on some systems there might be less or more
items under real /dev that on our list. The solution is to parse
/proc/mounts and fetch the list from there.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
If we restart libvirtd while VM was doing external memory snapshot, VM's
state be updated to paused as a result of running a migration-to-file
operation, and then VM will be left as paused state. In this case we must
restart the VM's CPUs to resume it.
Signed-off-by: Wang King <king.wang@huawei.com>
Commit 4b951d1e38 missed the fact that the
VM needs to be resumed after a live external checkpoint (memory
snapshot) where the cpus would be paused by the migration rather than
libvirt.
Again, not something that I'd hit, but there is a chance in
theory that this might bite us. Currently the way we decide
whether or not to create /dev entry for a device is by marching
first four characters of path with "/dev". This might be not
enough. Just imagine somebody has a disk image stored under
"/devil/path/to/disk". We ought to be matching against "/dev/".
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Not that I'd encounter any bug here, but the code doesn't look
100% correct. Imagine, somebody is trying to attach a device to a
domain, and the device's /dev entry already exists in the qemu
namespace. This is handled gracefully and the control continues
with setting up ACLs and calling security manager to set up
labels. Now, if any of these steps fail, control jump on the
'cleanup' label and unlink() the file straight away. Even when it
was not us who created the file in the first place. This can be
possibly dangerous.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=1406837
Imagine you have a domain configured in such way that you are
assigning two PCI devices that fall into the same IOMMU group.
With mount namespace enabled what happens is that for the first
PCI device corresponding /dev/vfio/X entry is created and when
the code tries to do the same for the second mknod() fails as
/dev/vfio/X already exists:
2016-12-21 14:40:45.648+0000: 24681: error :
qemuProcessReportLogError:1792 : internal error: Process exited
prior to exec: libvirt: QEMU Driver error : Failed to make device
/var/run/libvirt/qemu/windoze.dev//vfio/22: File exists
Worse, by default there are some devices that are created in the
namespace regardless of domain configuration (e.g. /dev/null,
/dev/urandom, etc.). If one of them is set as backend for some
guest device (e.g. rng, chardev, etc.) it's the same story as
described above.
Weirdly, in attach code this is already handled.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
https://bugzilla.redhat.com/show_bug.cgi?id=1405269
If a secret was not provided for what was determined to be a LUKS
encrypted disk (during virStorageFileGetMetadata processing when
called from qemuDomainDetermineDiskChain as a result of hotplug
attach qemuDomainAttachDeviceDiskLive), then do not attempt to
look it up (avoiding a libvirtd crash) and do not alter the format
to "luks" when adding the disk; otherwise, the device_add would
fail with a message such as:
"unable to execute QEMU command 'device_add': Property 'scsi-hd.drive'
can't find value 'drive-scsi0-0-0-0'"
because of assumptions that when the format=luks that libvirt would have
provided the secret to decrypt the volume.
Access to unlock the volume will thus be left to the application.
virQEMUCapsSupportsChardev existing checks returns true
for spapr-vty alone. Instead verify spapr-vty validity
and let the logic to return true for other device types
so that virtio-console passes.
The non-pseries machines dont have spapr-vio-bus. So, the
function always returned false for them before.
Fixes - https://bugzilla.redhat.com/show_bug.cgi?id=1257813
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
According to commit id '0282ca45a' the 'physical' value should
essentially be the last offset of the image or the host physical
size in bytes of the image container. However, commit id '15fa84ac'
refactored the GetBlockInfo to use the same returned data as the
GetStatsBlock API for an active domain. For the 'entry->physical'
that would end up being the "actual-size" as set through the
qemuMonitorJSONBlockStatsUpdateCapacityOne (commit '7b11f5e5').
Digging deeper into QEMU code one finds that actual_size is
filled in using the same algorithm as GetBlockInfo has used for
setting the 'allocation' field when the domain is inactive.
The difference in values is seen primarily in sparse raw files
and other container type files (such as qcow2), which will return
a smaller value via the stat API for 'st_blocks'. Additionally
for container files, the 'capacity' field (populated via the
QEMU "virtual-size" value) may be slightly different (smaller)
in order to accomodate the overhead for the container. For
sparse files, the state 'st_size' field is returned.
This patch thus alters the allocation and physical values for
sparse backed storage files to be more appropriate to the API
contract. The result for GetBlockInfo is the following:
capacity: logical size in bytes of the image (how much storage
the guest will see)
allocation: host storage in bytes occupied by the image (such
as highest allocated extent if there are no holes,
similar to 'du')
physical: host physical size in bytes of the image container
(last offset, similar to 'ls')
NB: The GetStatsBlock API allows a different contract for the
values:
"block.<num>.allocation" - offset of the highest written sector
as unsigned long long.
"block.<num>.capacity" - logical size in bytes of the block device
backing image as unsigned long long.
"block.<num>.physical" - physical size in bytes of the container
of the backing image as unsigned long long.
Disk->info is not live updatable so add a check for this. Otherwise
libvirt reports success even though no data was updated.
Signed-off-by: Marc Hartmayer <mhartmay@linux.vnet.ibm.com>
Reviewed-by: Bjoern Walk <bwalk@linux.vnet.ibm.com>
Reviewed-by: Boris Fiuczynski <fiuczy@linux.vnet.ibm.com>
External disk-only snapshots with recent enough qemu don't require
libvirt to pause the VM. The logic determining when to resume cpus was
slightly flawed and attempted to resume them even if they were not
paused by the snapshot code. This normally was not a problem, but with
locking enabled the code would attempt to acquire the lock twice.
The fallout of this bug would be a error from the API, but the actual
snapshot being created. The bug was introduced with when adding support
for external snapshots with memory (checkpoints) in commit f569b87.
Resolves problems described by:
https://bugzilla.redhat.com/show_bug.cgi?id=1403691
After qemu delivers the resume event it's already running and thus it's
too late to enter lockspaces since it may already have modified the
disk. The code only creates false log entries in the case when locking
is enabled. The lockspace needs to be acquired prior to starting cpus.
Given how intrusive previous patches are, it might happen that
there's a bug or imperfection. Lets give users a way out: if they
set 'namespaces' to an empty array in qemu.conf the feature is
suppressed.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When attaching a device to a domain that's using separate mount
namespace we must maintain /dev entries in order for qemu process
to see them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When attaching a device to a domain that's using separate mount
namespace we must maintain /dev entries in order for qemu process
to see them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When attaching a device to a domain that's using separate mount
namespace we must maintain /dev entries in order for qemu process
to see them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When attaching a device to a domain that's using separate mount
namespace we must maintain /dev entries in order for qemu process
to see them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Instead of trying to fix our security drivers, we can use a
simple trick to relabel paths in both namespace and the host.
I mean, if we enter the namespace some paths are still shared
with the host so any change done to them is visible from the host
too.
Therefore, we can just enter the namespace and call
SetAllLabel()/RestoreAllLabel() from there. Yes, it has slight
overhead because we have to fork in order to enter the namespace.
But on the other hand, no complexity is added to our code.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
When starting a domain and separate mount namespace is used, we
have to create all the /dev entries that are configured for the
domain.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Prime time. When it comes to spawning qemu process and
relabelling all the devices it's going to touch, there's inherent
race with other applications in the system (e.g. udev). Instead
of trying convincing udev to not touch libvirt managed devices,
we can create a separate mount namespace for the qemu, and mount
our own /dev there. Of course this puts more work onto us as we
have to maintain /dev files on each domain start and device
hot(un-)plug. On the other hand, this enhances security also.
From technical POV, on domain startup process the parent
(libvirtd) creates:
/var/lib/libvirt/qemu/$domain.dev
/var/lib/libvirt/qemu/$domain.devpts
The child (which is going to be qemu eventually) calls unshare()
to create new mount namespace. From now on anything that child
does is invisible to the parent. Child then mounts tmpfs on
$domain.dev (so that it still sees original /dev from the host)
and creates some devices (as explained in one of the previous
patches). The devices have to be created exactly as they are in
the host (including perms, seclabels, ACLs, ...). After that it
moves $domain.dev mount to /dev.
What's the $domain.devpts mount there for then you ask? QEMU can
create PTYs for some chardevs. And historically we exposed the
host ends in our domain XML allowing users to connect to them.
Therefore we must preserve devpts mount to be shared with the
host's one.
To make this patch as small as possible, creating of devices
configured for domain in question is implemented in next patches.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
This is a list of devices that qemu needs for its run (apart from
what's configured for domain). The devices on the list are
enabled in the CGroups by default so they will be good candidates
for initial /dev for new qemu.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Using a variable named 'stat' clashes with the system function
'stat()' causing compiler warnings on some platforms
cc1: warnings being treated as errors
../../src/qemu/qemu_monitor_text.c: In function 'parseMemoryStat':
../../src/qemu/qemu_monitor_text.c:604: error: declaration of 'stat' shadows a global declaration [-Wshadow]
/usr/include/sys/stat.h:455: error: shadowed declaration is here [-Wshadow]
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
If the cpuset cgroup controller is disabled in /etc/libvirt/qemu.conf
QEMU virtual machines can in principle use all host CPUs, even if they
are hot plugged, if they have no explicit CPU affinity defined.
However, there's libvirt code supposed to handle the situation where
the libvirt daemon itself is not using all host CPUs. The code in
qemuProcessInitCpuAffinity attempts to set an affinity mask including
all defined host CPUs. Unfortunately, the resulting affinity mask for
the process will not contain the offline CPUs. See also the
sched_setaffinity(2) man page.
That means that even if the host CPUs come online again, they won't be
used by the QEMU process anymore. The same is true for newly hot
plugged CPUs. So we are effectively preventing that QEMU uses all
processors instead of enabling it to use them.
It only makes sense to set the QEMU process affinity if we're able
to actually grow the set of usable CPUs, i.e. if the process affinity
is a subset of the online host CPUs.
There's still the chance that for some reason the deliberately chosen
libvirtd affinity matches the online host CPU mask by accident. In this
case the behavior remains as it was before (CPUs offline while setting
the affinity will not be used if they show up later on).
Signed-off-by: Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com>
Tested-by: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
virQEMUCapsFindTarget is supposed to find an alternative QEMU binary if
qemu-system-$GUEST_ARCH doesn't exist. The alternative is using host
architecture when it is compatible with $GUEST_ARCH. But a special
treatment has to be applied for ppc64le since the QEMU binary is always
called qemu-system-ppc64.
Broken by me in v2.2.0-171-gf2e71550d.
https://bugzilla.redhat.com/show_bug.cgi?id=1403745
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
qemuAgentNotifyEvent accesses monitor structure and is called on qemu
reset/shutdown/suspend events under domain lock. Other monitor
functions on the other hand take monitor lock and don't hold domain lock.
Thus it is possible to have risky simultaneous access to the structure
from 2 threads. Let's take monitor lock here to make access exclusive.
In case of 0 filesystems *info is not set while according
to virDomainGetFSInfo contract user should call free on it even
in case of 0 filesystems. Thus we need to properly set
it. NULL will be enough as free eats NULLs ok.
The libvirt-domain.h documentation indicates that for a qcow2 file
in a filesystem being used for a backing store should report the disk
space occupied by a file; however, commit id '15fa84ac' altered the
code to trust that the wr_highest_offset should be used whenever
wr_highest_offset_valid was set.
As it turns out this will lead to indeterminite results. For an active
domain when qemu hasn't yet had the need to find the wr_highest_offset
value, qemu will report 0 even though qemu-img will report the proper
disk size. This causes reporting of the following XML:
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/test-1g.qcow2'/>
to be as follows:
Capacity: 1073741824
Allocation: 0
Physical: 1074139136
with qemu-img indicating:
image: /path/to/test-1g.qcow2
file format: qcow2
virtual size: 1.0G (1073741824 bytes)
disk size: 1.0G
Once the backing source file is opened on the guest, then wr_highest_offset
is updated, but only to the high water mark and not the size of the file.
This patch will adjust the logic to check for the file backed qcow2 image
and enforce setting the allocation to the returned 'physical' value, which
is the 'actual-size' value from a 'query-block' operation.
NB: The other consumer of the wr_highest_offset output (GetAllDomainStats)
has a contract that indicates 'allocation' is the offset of the highest
written sector, so it doesn't need adjustment.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Instead of having duplicated code in qemuStorageLimitsRefresh and
virStorageBackendUpdateVolTargetInfo to get capacity specific data
about the storage backing source or volume -- create a common API
to handle the details for both.
As a side effect, virStorageFileProbeFormatFromBuf returns to being
a local/static helper to virstoragefile.c
For the QEMU code - if the probe is done, then the format is saved so
as to avoid future such probes.
For the storage backend code, there is no need to deal with the probe
since we cannot call the new API if target->format == NONE.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Instead of having duplicated code in qemuStorageLimitsRefresh and
virStorageBackendUpdateVolTargetInfoFD to fill in the storage backing
source or volume allocation, capacity, and physical values - create a
common API that will handle the details for both.
The common API will fill in "default" capacity values as well - although
those more than likely will be overridden by subsequent code. Having just
one place to make the determination of what the values should be will
make things be more consistent.
For the QEMU code - the data filled in will be for inactive domains
for the GetBlockInfo and DomainGetStatsOneBlock API's. For the storage
backend code - the data will be filled in during the volume updates.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Commit id '8dc27259' introduced virStorageSourceUpdateBlockPhysicalSize
in order to retrieve the physical size for a block backed source device
for an active domain since commit id '15fa84ac' changed to use the
qemuMonitorGetAllBlockStatsInfo and qemuMonitorBlockStatsUpdateCapacity
API's to (essentially) retrieve the "actual-size" from a 'query-block'
operation for the source device.
However, the code only was made functional for a BLOCK backing type
and it neglected to use qemuOpenFile, instead using just open. After
the open the block lseek would find the end of the block and set the
physical value, close the fd and return.
Since the code would return 0 immediately if the source device wasn't
a BLOCK backed device, the physical would be displayed incorrectly,
such as follows in domblkinfo for a file backed source device:
Capacity: 1073741824
Allocation: 0
Physical: 0
This patch will modify the algorithm to get the physical size for other
backing types and it will make use of the qemuDomainStorageOpenStat
helper in order to open/stat the source file depending on its type.
The qemuDomainGetStatsOneBlock will no longer inhibit printing errors,
but it will still ignore them leaving the physical value set to 0.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Split out the opening of the file and fetch of the stat buffer into a
helper qemuDomainStorageOpenStat. This will handle either opening the
local or remote storage.
Additionally split out the cleanup of that into a separate helper
qemuDomainStorageCloseStat which will either close the file or
call the virStorageFileDeinit function.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Originally added by commit id '89646e69' prior to commit id '15fa84ac'
and '71d2c172' which ensured that qemuStorageLimitsRefresh was only called
for inactive domains.
Adjust the comment describing the need for FIXME and move all the text
to the function description.
Signed-off-by: John Ferlan <jferlan@redhat.com>
In preparation to the code move to virnetdevtap.c, this change:
* renames virNetInterfaceStats to virNetDevTapInterfaceStats
* changes 'path' to 'ifname', to use the same vocable as other
method in virnetdevtap.c.
* Add the attributes checker
When vhostuser interfaces are used, the interface statistics
are not available in /proc/net/dev.
This change looks at the openvswitch interfaces statistics
tables to provide this information for vhostuser interface.
Note that in openvswitch world drop/error doesn't always make sense
for some interface type. When these informations are not available we
set them to 0 on the virDomainInterfaceStats.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
There's nothing to compress if the requested snapshot memory format is
set to 'raw' explicitly. After commit 9e14689ea libvirt would try to
run /sbin/raw to process the memory stream if the qemu.conf option
snapshot_image_format is set.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1402726
Since its introduction in 2012 this internal API did nothing.
Moreover we have the same API that does exactly the same:
virSecurityManagerDomainSetPathLabel.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
If you've ever tried running a huge page backed guest under
different user than in qemu.conf, you probably failed. Problem is
even though we have corresponding APIs in the security drivers,
there's no implementation and thus we don't relabel the huge page
path. But even if we did, so far all of the domains share the
same path:
/hugepageMount/libvirt/qemu
Our only option there would be to set 0777 mode on the qemu dir
which is totally unsafe. Therefore, we can create dir on
per-domain basis, i.e.:
/hugepageMount/libvirt/qemu/domainName
and chown domainName dir to the user that domain is configured to
run under.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
So far this function takes virDomainObjPtr which:
1) is an overkill,
2) might be not available in all the places we will use it.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Qemu 2.8.0+ changes arguments structure for blockdev-add in the effort
to make it finally stable. Since libvirt recently added the detection of
gluster debug support relying on the old syntax we need to add the new
as well.
With current perf framework, this patch adds support and documentation
for the branch_instructions perf event.
Signed-off-by: Nitesh Konkar <nitkon12@linux.vnet.ibm.com>
Add in the block I/O throttling group parameter to the command line
if supported. If not supported, fail command creation.
Add the xml2argvtest for testing.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Rather than have multiple bool values, create a single enum with bits
representing what fields are set. Fields are generally set in groups
of 3 (read, write, total).
Currently we build the JSON object for the "block_set_io_throttle"
command using the knowledge that a NULL for a support*Options boolean
would essentially ignore the rest of the arguments.
This may not work properly if some capability was backported, plus it just
looks rather ugly. So instead, build the "base" arguments and then if
the support*Option bool capability is set, add in the arguments on the fly.
Then append those arguments to the basic command and send to qemu.
Rather than using negative logic and setting the maxparams to a lesser
value based on which capabilities exist, alter the logic to modify the
maxparams based on a base value plus the found capabilities. Reduces the
chance that some backported feature produces an incorrect value.
Two reasons:
1.in none hotplug, we will pass it. We can see from libvirt function
qemuBuildVhostuserCommandLine
2.qemu will use this vetcor num to init msix table. If we don't pass, qemu
will use default value, this will cause VM can only use default value
interrupts at most.
Signed-off-by: gaohaifeng <gaohaifeng.gao@huawei.com>
Consider the following XML snippets:
$ cat scsicontroller.xml
<controller type='scsi' model='virtio-scsi' index='0'/>
$ cat scsihostdev.xml
<hostdev mode='subsystem' type='scsi'>
<source>
<adapter name='scsi_host0'/>
<address bus='0' target='8' unit='1074151456'/>
</source>
</hostdev>
If we create a guest that includes the contents of scsihostdev.xml,
but forget the virtio-scsi controller described in scsicontroller.xml,
one is silently created for us. The same holds true when attaching
a hostdev before the matching virtio-scsi controller.
(See qemuDomainFindOrCreateSCSIDiskController for context.)
Detaching the hostdev, followed by the controller, works well and the
guest behaves appropriately.
If we detach the virtio-scsi controller device first, any associated
hostdevs are detached for us by the underlying virtio-scsi code (this
is fine, since the connection is broken). But all is not well, as the
guest is unable to receive new virtio-scsi devices (the attach commands
succeed, but devices never appear within the guest), nor even be
shutdown, after this point.
While this is not libvirt's problem, we can prevent falling into this
scenario by checking if a controller is being used by any hostdev
devices. The same is already done for disk elements today.
Applying this patch and then using the XML snippets from earlier:
$ virsh detach-device guest_01 scsicontroller.xml
error: Failed to detach device from scsicontroller.xml
error: operation failed: device cannot be detached: device is busy
$ virsh detach-device guest_01 scsihostdev.xml
Device detached successfully
$ virsh detach-device guest_01 scsicontroller.xml
Device detached successfully
Signed-off-by: Eric Farman <farman@linux.vnet.ibm.com>
Reviewed-by: Bjoern Walk <bwalk@linux.vnet.ibm.com>
Reviewed-by: Boris Fiuczynski <fiuczy@linux.vnet.ibm.com>
Although nearly all host devices that are assigned to guests using
VFIO ("<hostdev>" devices in libvirt) are physically PCI Express
devices, until now libvirt's PCI address assignment has always
assigned them addresses on legacy PCI controllers in the guest, even
if the guest's machinetype has a PCIe root bus (e.g. q35 and
aarch64/virt).
This patch tries to assign them to an address on a PCIe controller
instead, when appropriate. First we do some preliminary checks that
might allow setting the flags without doing any extra work, and if
those conditions aren't met (and if libvirt is running privileged so
that it has proper permissions), we perform the (relatively) time
consuming task of reading the device's PCI config to see if it is an
Express device. If this is successful, the connect flags are set based
on the result, but if we aren't able to read the PCI config (most
likely due to the device not being present on the system at the time
of the check) we assume it is (or will be) an Express device, since
that is almost always the case anyway.
If libvirtd is running unprivileged, it can open a device's PCI config
data in sysfs, but can only read the first 64 bytes. But as part of
determining whether a device is Express or legacy PCI,
qemuDomainDeviceCalculatePCIConnectFlags() will be updated in a future
patch to call virPCIDeviceIsPCIExpress(), which tries to read beyond
the first 64 bytes of the PCI config data and fails with an error log
if the read is unsuccessful.
In order to avoid creating a parallel "quiet" version of
virPCIDeviceIsPCIExpress(), this patch passes a virQEMUDriverPtr down
through all the call chains that initialize the
qemuDomainFillDevicePCIConnectFlagsIterData, and saves the driver
pointer with the rest of the iterdata so that it can be used by
qemuDomainDeviceCalculatePCIConnectFlags(). This pointer isn't used
yet, but will be used in an upcoming patch (that detects Express vs
legacy PCI for VFIO assigned devices) to examine driver->privileged.
Restarting libvirtd on the source host at the end of migration when a
domain is already running on the destination would cause image labels to
be reset effectively killing the domain. Commit e8d0166e1d fixed similar
issue on the destination host, but kept the source always resetting the
labels, which was mostly correct except for the specific case handled by
this patch.
https://bugzilla.redhat.com/show_bug.cgi?id=1343858
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Post-copy migration needs bi-directional communication between the
source and the destination QEMU processes, which is not supported by
tunnelled migration.
https://bugzilla.redhat.com/show_bug.cgi?id=1371358
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Thanks to the complex capability caching code virQEMUCapsProbeQMP was
never called when we were starting a new qemu VM. On the other hand,
when we are reconnecting to the qemu process we reload the capability
list from the status XML file. This means that the flag preventing the
function being called was not set and thus we partially reprobed some of
the capabilities.
The recent addition of CPU hotplug clears the
QEMU_CAPS_QUERY_HOTPLUGGABLE_CPUS if the machine does not support it.
The partial re-probe on reconnect results into attempting to call the
unsupported command and then killing the VM.
Remove the partial reprobe and depend on the stored capabilities. If it
will be necessary to reprobe the capabilities in the future, we should
do a full reprobe rather than this partial one.
QEMU 2.8.0 adds support for unavailable-features in
query-cpu-definitions reply. The unavailable-features array lists CPU
features which prevent a corresponding CPU model from being usable on
current host. It can only be used when all the unavailable features are
disabled. Empty array means the CPU model can be used without
modifications.
We can use unavailable-features for providing CPU model usability info
in domain capabilities XML:
<domainCapabilities>
...
<cpu>
<mode name='host-passthrough' supported='yes'/>
<mode name='host-model' supported='yes'>
<model fallback='allow'>Skylake-Client</model>
...
</mode>
<mode name='custom' supported='yes'>
<model usable='yes'>qemu64</model>
<model usable='yes'>qemu32</model>
<model usable='no'>phenom</model>
<model usable='yes'>pentium3</model>
<model usable='yes'>pentium2</model>
<model usable='yes'>pentium</model>
<model usable='yes'>n270</model>
<model usable='yes'>kvm64</model>
<model usable='yes'>kvm32</model>
<model usable='yes'>coreduo</model>
<model usable='yes'>core2duo</model>
<model usable='no'>athlon</model>
<model usable='yes'>Westmere</model>
<model usable='yes'>Skylake-Client</model>
...
</mode>
</cpu>
...
</domainCapabilities>
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
"host" CPU model is supported by a special host-passthrough CPU mode and
users is not allowed to specify this model directly with custom mode.
Thus we should not advertise "host" CPU model in domain capabilities.
This worked well on architectures for which libvirt provides a list of
supported CPU models in cpu_map.xml (since "host" is not in the list).
But we need to explicitly filter "host" model out for all other
architectures.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
CPU models (and especially some additional details which we will start
probing for later) differ depending on the accelerator. Thus we need to
call query-cpu-definitions in both KVM and TCG mode to get all data we
want.
Tests in tests/domaincapstest.c are temporarily switched to TCG to avoid
having to squash even more stuff into this single patch. They will all
be switched back later in separate commits.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
This patch moves the CPU models formatting code from
virQEMUCapsFormatCache into a separate function.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>