Some secdrivers (typically SELinux driver) generate unique
dynamic seclabel for each domain (unless a static one is
requested in domain XML). This is achieved by calling
qemuSecurityGenLabel() from qemuProcessPrepareDomain() which
allocates unique seclabel and stores it in domain def->seclabels.
The counterpart is qemuSecurityReleaseLabel() which releases the
label and removes it from def->seclabels. Problem is, that with
current code the qemuProcessStop() may still want to use the
seclabel after it was released, e.g. when it wants to restore the
label of a disk mirror.
What is happening now, is that in qemuProcessStop() the
qemuSecurityReleaseLabel() is called, which removes the SELinux
seclabel from def->seclabels, yada yada yada and eventually
qemuSecurityRestoreImageLabel() is called. This bubbles down to
virSecuritySELinuxRestoreImageLabelSingle() which find no SELinux
seclabel (using virDomainDefGetSecurityLabelDef()) and this
returns early doing nothing.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1751664
Fixes: 8fa0374c5b
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
The kernel refuses to set guest TSC frequency less than a minimum
frequency or greater than maximum frequency (both computed based on the
host TSC frequency). When writing the libvirt code with a reversed logic
(return success when the requested frequency falls within the tolerance
interval) I forgot to include the boundaries.
Fixes: d8e5b45600https://bugzilla.redhat.com/show_bug.cgi?id=1839095
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
When starting a VM with an empty cdrom which has <iotune> configured the
startup fails as qemu is not happy about setting tuning for an empty
drive:
error: internal error: unable to execute 'block_set_io_throttle', unexpected error: 'Device has no medium'
Resolve this by skipping the setting of throttling for empty drives and
updating the throttling when new medium is inserted into the drive.
Resolves: https://gitlab.com/libvirt/libvirt/-/issues/111
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Simplify ReserveName/GenerateName for macvlan and macvtap by using
common functions.
Signed-off-by: Shi Lei <shi_lei@massclouds.com>
Reviewed-by: Laine Stump <laine@redhat.com>
Simplify GenerateName/ReserveName for netdevtap by using common
functions.
Signed-off-by: Shi Lei <shi_lei@massclouds.com>
Reviewed-by: Laine Stump <laine@redhat.com>
Move virDomainDeviceDefValidate() and all its helper functions to
domain_validate.c.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
virDomainDefPostParse infrastructure has apart from the global opaque
data also per-run data, but this was not duplicated into the validation
callbacks.
This is important when drivers want to use correct run-state for the
validation.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Nodename may be asociated to a disk backup job, add support to looking
up in that chain too. This is specifically useful for the
BLOCK_WRITE_THRESHOLD event which can be registered for any nodename.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
In v6.0.0-rc1~439 (and friends) we tried to cache NUMA
capabilities because we assumed they are immutable. And to some
extent they are (NUMA hotplug is not a thing, is it). However,
our capabilities contain also some runtime info that can change,
e.g. hugepages pool allocation sizes or total amount of memory
per node (host side memory hotplug might change the value).
Because of the caching we might not be reporting the correct
runtime info in 'virsh capabilities'.
The NUMA caps are used in three places:
1) 'virsh capabilities'
2) domain startup, when parsing numad reply
3) parsing domain private data XML
In cases 2) and 3) we need NUMA caps to construct list of
physical CPUs that belong to NUMA nodes from numad reply. And
while this may seem static, it's not really because of possible
CPU hotplug on physical host.
There are two possible approaches:
1) build a validation mechanism that would invalidate the
cached NUMA caps, or
2) drop the caching and construct NUMA caps from scratch on
each use.
In this commit, the latter approach is implemented, because it's
easier.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1819058
Fixes: 1a1d848694
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
qemuBuildCommandLine() is calling qemuDomainAlignMemorySizes(),
which is an operation that changes live XML and domain and has
little to do with the command line build process.
Move it to qemuProcessPrepareDomain() where we're supposed to
make live XML and domain changes before launch. qemuProcessStart()
is setting VIR_QEMU_PROCESS_START_NEW if !migrate && !snapshot,
same conditions used in qemuBuildCommandLine() to call
qemuDomainAlignMemorySizes(), making this change seamless.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
qemuProcessCreatePretendCmdPrepare() is setting the
VIR_QEMU_PROCESS_START_NEW regardless of whether this is
a migration case or not. This behavior differs from what we're
doing in qemuProcessStart(), where the flag is set only
if !migrate && !snapshot.
Fix it by making the flag setting consistent with what we're
doing in qemuProcessStart().
Reviewed-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Let's pass along / fill @niothreads rather than trying to make dual
use as a return value and thread count.
This resolves a Coverity issue detected in qemuDomainGetIOThreadsMon
where if qemuDomainObjExitMonitor failed, then a -1 was returned and
overwrite @niothreads causing a memory leak.
Signed-off-by: John Ferlan <jferlan@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Glib provides g_auto(GStrv) which is in-place replacement of our
VIR_AUTOSTRINGLIST.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Some CPUs provide a way to read exact TSC frequency, while measuring it
is required on other CPUs. However, measuring is never exact and the
result may slightly differ across reboots. For this reason both Linux
kernel and QEMU recently started allowing for guests TSC frequency to
fall into +/- 250 ppm tolerance interval around the host TSC frequency.
Let's do the same to avoid unnecessary failures (esp. during migration)
in case the host frequency does not exactly match the frequency
configured in a domain XML.
https://bugzilla.redhat.com/show_bug.cgi?id=1839095
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
A qemu guest which has virtiofs config fails to start if the previous
starting failed because of invalid option or something.
That's because the virtiofsd isn't killed by virPidFileForceCleanupPath()
on the former failure because the pidfile was already removed by
virFileDeleteTree(priv->libDir) in qemuProcessStop(), so
virPidFileForceCleanupPath() just returned.
Move qemuExtDevicesStop() before virFileDeleteTree(priv->libDir) so that
virPidFileForceCleanupPath() can kill virtiofsd correctly.
For example of the reproduction:
# virsh start guest
error: Failed to start domain guest
error: internal error: process exited while connecting to monitor: qemu-system-x86_64: -foo: invalid option
... fix the option ...
# virsh start guest
error: Failed to start domain guest
error: Cannot open log file: '/var/log/libvirt/qemu/guest-fs0-virtiofsd.log': Device or resource busy
#
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Don't hide our use of GHashTable behind our typedef. This will also
promote the use of glibs hash function directly.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Matt Coleman <matt@datto.com>
Currently all errors from qemuInterfacePrepareSlirp() are completely
ignored by the callers. The intention is that missing qemu-slirp binary
should cause the caller to fallback to the built-in slirp impl.
Many of the possible errors though should indeed be considered fatal.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Since QEMU 5.2 (commit-77b285f7f6), QEMU supports 'memory failure'
event, posts event to monitor if hitting a hardware memory error.
Fully support this feature for QEMU.
Test with commit 'libvirt: support memory failure event', build a
little complex environment(nested KVM):
1, install newly built libvirt in L1, and start a L2 vm. run command
in L1:
~# virsh event l2 --event memory-failure
2, run command in L0 to inject MCE to L1:
~# virsh qemu-monitor-command l1 --hmp mce 0 9 0xbd000000000000c0 0xd 0x62000000 0x8c
Test result in l1(recipient hypervisor case):
event 'memory-failure' for domain l2:
recipient: hypervisor
action: ignore
flags:
action required: 0
recursive: 0
Test result in l1(recipient guest case):
event 'memory-failure' for domain l2:
recipient: guest
action: inject
flags:
action required: 0
recursive: 0
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
All users of virHashTable pass strings as the name/key of the entry.
Make this an official requirement by turning the variables to 'const
char *'.
For any other case it's better to use glib's GHashTable.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
Use of the -enable-fips option is being deprecated in QEMU >= 5.2.0. If
FIPS compliance is required, QEMU must be built with libcrypt which will
unconditionally enforce it.
Thus there is no need for libvirt to pass -enable-fips to modern QEMU.
Unfortunately there was never any way to probe for -enable-fips in the
first instance, it was enabled by libvirt based on version number
originally, and then later unconditionally enabled when libvirt dropped
support for older QEMU. Similarly we now use a version number check to
decide when to stop passing -enable-fips.
Note that the qemu-5.2 capabilities are currently from the pre-release
version and will be updated once qemu-5.2 is released.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
This patch adds new schema and adds support for parsing and formatting
domain configurations that include vdpa devices.
vDPA network devices allow high-performance networking in a virtual
machine by providing a wire-speed data path. These devices require a
vendor-specific host driver but the data path follows the virtio
specification.
When a device on the host is bound to an appropriate vendor-specific
driver, it will create a chardev on the host at e.g. /dev/vhost-vdpa-0.
That chardev path can then be used to define a new interface with
type='vdpa'.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Laine Stump <laine@redhat.com>
SCSI hostdev setup requires querying the host os for the actual path of
the configured hostdev. This was historically done in the command line
formatter. Our new approach is to split out this part into
'qemuProcessPrepareHost' which is designed to be skipped in tests.
Refactor the hostdev code to use this new semantics, and add appropriate
handlers filling in the data for tests and the qemuConnectDomainXMLToNative
users.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Host preparation steps which are deliberately skipped when
pretend-creating a commandline are normally executed after VM object
preparation. In the test code we are faking some of the host
preparation steps, but we were doing that prior to the call to
qemuProcessPrepareDomain embedded in qemuProcessCreatePretendCmd.
By splitting up qemuProcessCreatePretendCmd into two functions we can
ensure that the ordering of the prepare steps stays consistent.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
These XML attributes have been mandatory since the introduction of SEV
support to libvirt. This design decision was based on QEMU's
requirement for these to be mandatory for migration purposes, as
differences in these values across platforms must result in the
pre-migration checks failing (not that migration with SEV works at the
time of this patch).
This patch enables autofill of these attributes right before launching
QEMU and thus updating the live XML.
Signed-off-by: Erik Skultety <eskultet@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Checks such as this one should be done at domain def validation time,
not before starting the QEMU process.
As for this change, existing domains will see some QEMU error when
starting as opposed to a libvirt error that this QEMU binary doesn't
support SEV, but that's okay, we never guaranteed error messages to
remain the same.
Signed-off-by: Erik Skultety <eskultet@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
As preparation for g_autoptr() we need to change the function to take
only virCgroupPtr.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Jonathon Jongsma <jjongsma@redhat.com>
Move them to separate conditions to reduce churn
in following patches.
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Erik Skultety <eskultet@redhat.com>
Add overlays after the VM starts before we start executing guest code.
Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Tested-by: Ján Tomko <jtomko@redhat.com>
Later patches will implement support for <transient/> disks in libvirt
by installing an overlay on top of the configured image. This will
require cleanup after the VM will be stopped so that the state is
correctly discarded.
Since the overlay will be installed only during the startup phase of the
VM we need to ensure that qemuProcessStop doesn't delete the original
file on some previous failure. This is solved by adding
'inhibitDiskTransientDelete' VM private data member which is set prior
to any startup step and will be cleared once transient disk overlays are
established.
Based on that we can then delete the overlays for any <transient/> disk.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Tested-by: Ján Tomko <jtomko@redhat.com>
Historically we've prepared secrets for all objects in one place. This
doesn't make much sense and it's semantically more appealing to prepare
everything for a single device type in one place.
Move the setup of the (iSCSI|SCSI) hostdev secrets into a new function
which will be used to setup other things as well in the future.
This is a similar approach we do for disks.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
After we started copying the privateData pointer in
qemuDomainObjRestoreJob, we should also free them
once we're done with them.
Register the clear function and use g_auto.
Also add a check for job->cb to qemuDomainObjClearJob,
to prevent freeing an uninitialized job.
https://bugzilla.redhat.com/show_bug.cgi?id=1878450
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Fixes: aca37c3fb2
When starting a domain, qemuProcessLaunch() iterates over all
VIR_PERF_EVENT_* values and (possibly) enables them. While there
is nothing wrong with the code, the for loop where it's done makes
it harder to jump onto next block of code.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
This is just a clean-up of commit 3791f29b08 using the new parameter of
virProcessSetAffinity() introduced in commit 9514e24984 so that there is
no error reported in the logs.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
This hunk was introduced in [1] in order to avoid loosing
events from monitor on stopping qemu process. But as explained
in [2] on destroy we won't get neither EOF nor any other
events as monitor is just closed. In case of crash/shutdown
we won't get any more events as well and qemuDomainObjStopWorker
will be called by qemuProcessStop eventually. Thus let's
remove qemuDomainObjStopWorker from qemuProcessHandleMonitorEOF
as it is not useful anymore.
[1] e6afacb0f: qemu: start/stop an event loop thread for domains
[2] d2954c072: qemu: ensure domain event thread is always stopped
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Consider a host with 8 CPUs. There are the following possible scenarios
1. Bare metal; libvirtd has affinity of 8 CPUs; QEMU should get 8 CPUs
2. Bare metal; libvirtd has affinity of 2 CPUs; QEMU should get 8 CPUs
3. Container has affinity of 8 CPUs; libvirtd has affinity of 8 CPus;
QEMU should get 8 CPUs
4. Container has affinity of 8 CPUs; libvirtd has affinity of 2 CPus;
QEMU should get 8 CPUs
5. Container has affinity of 4 CPUs; libvirtd has affinity of 4 CPus;
QEMU should get 4 CPUs
6. Container has affinity of 4 CPUs; libvirtd has affinity of 2 CPus;
QEMU should get 4 CPUs
Scenarios 1 & 2 always work unless systemd restricted libvirtd privs.
Scenario 3 works because libvirt checks current affinity first and
skips the sched_setaffinity call, avoiding the SYS_NICE issue
Scenario 4 works only if CAP_SYS_NICE is availalbe
Scenarios 5 & 6 works only if CAP_SYS_NICE is present *AND* the cgroups
cpuset is not set on the container.
If libvirt blindly ignores the sched_setaffinity failure, then scenarios
4, 5 and 6 should all work, but with caveat in case 4 and 6, that
QEMU will only get 2 CPUs instead of the possible 8 and 4 respectively.
This is still better than failing.
Therefore libvirt can blindly ignore the setaffinity failure, but *ONLY*
ignore it when there was no affinity specified in the XML config.
If user specified affinity explicitly, libvirt must report an error if
it can't be honoured.
Resolves: https://bugzilla.redhat.com/1819801
Suggested-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Currently, we are mixing: #if HAVE_BLAH with #if WITH_BLAH.
Things got way better with Pavel's work on meson, but apparently,
mixing these two lead to confusing and easy to miss bugs (see
31fb929eca for instance). While we were forced to use HAVE_
prefix with autotools, we are free to chose our own prefix with
meson and since WITH_ prefix appears to be more popular let's use
it everywhere.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
When creating a standard tap device, if provided with an ifname that
contains "%d", rather than taking that literally as the name to use
for the new device, the kernel will instead use that string as a
template, and search for the lowest number that could be put in place
of %d and produce an otherwise unused and unique name for the new
device. For example, if there is no tap device name given in the XML,
libvirt will always send "vnet%d" as the device name, and the kernel
will create new devices named "vnet0", "vnet1", etc. If one of those
devices is deleted, creating a "hole" in the name list, the kernel
will always attempt to reuse the name in the hole first before using a
name with a higher number (i.e. it finds the lowest possible unused
number).
The problem with this, as described in the previous patch dealing with
macvtap device naming, is that it makes "immediate reuse" of a newly
freed tap device name *much* more common, and in the aftermath of
deleting a tap device, there is some other necessary cleanup of things
which are named based on the device name (nwfilter rules, bandwidth
rules, OVS switch ports, to name a few) that could end up stomping
over the top of the setup of a new device of the same name for a
different guest.
Since the kernel "create a name based on a template" functionality for
tap devices doesn't exist for macvtap, this patch for standard tap
devices is a bit different from the previous patch for macvtap - in
particular there was no previous "bitmap ID reservation system" or
overly-complex retry loop that needed to be removed. We simply find
and unused name, and pass that name on to the kernel instead of
"vnet%d".
This counter is also wrapped when either it gets to INT_MAX or if the
full name would overflow IFNAMSIZ-1 characters. In the case of
"vnet%d" and a 32 bit int, we would reach INT_MAX first, but possibly
someday someone will change the name from vnet to something else.
(NB: It is still possible for a user to provide their own
parameterized template name (e.g. "mytap%d") in the XML, and libvirt
will just pass that through to the kernel as it always has.)
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
There have been some reports that, due to libvirt always trying to
assign the lowest numbered macvtap / tap device name possible, a new
guest would sometimes be started using the same tap device name as
previously used by another guest that is in the process of being
destroyed *as the new guest is starting.
In some cases this has led to, for example, the old guest's
qemuProcessStop() code deleting a port from an OVS switch that had
just been re-added by the new guest (because the port name is based on
only the device name using the port). Similar problems can happen (and
I believe have) with nwfilter rules and bandwidth rules (which are
both instantiated based on the name of the tap device).
A couple patches have been previously proposed to change the ordering
of startup and shutdown processing, or to put a mutex around
everything related to the tap/macvtap device name usage, but in the
end no matter what you do there will still be possible holes, because
the device could be deleted outside libvirt's control (for example,
regular tap devices are automatically deleted when the qemu process
terminates, and that isn't always initiated by libvirt but could
instead happen completely asynchronously - libvirt then has no control
over the ordering of shutdown operations, and no opportunity to
protect it with a mutex.)
But this only happens if a new device is created at the same time as
one is being deleted. We can effectively eliminate the chance of this
happening if we end the practice of always looking for the lowest
numbered available device name, and instead just keep an integer that
is incremented each time we need a new device name. At some point it
will need to wrap back around to 0 (in order to avoid the IFNAMSIZ 15
character limit if nothing else), and we can't guarantee that the new
name really will be the *least* recently used name, but "math"
suggests that it will be *much* less common that we'll try to re-use
the *most* recently used name.
This patch implements such a counter for macvtap/macvlan, replacing
the existing, and much more complicated, "ID reservation" system. The
counter is set according to whatever macvtap/macvlan devices are
already in use by guests when libvirtd is started, incremented each
time a new device name is needed, and wraps back to 0 when either
INT_MAX is reached, or when the resulting device name would be longer
than IFNAMSIZ-1 characters (which actually is what happens when the
template for the device name is "maccvtap%d"). The result is that no
macvtap name will be re-used until the host has created (and possibly
destroyed) 99,999,999 devices.
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Split those initializations that depend on a statement
above them.
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>