In commit c43718ef67944 I've added a disclaimer that the new stats which
are fetched from qemu and passed directly to the user are not guaranteed
by libvirt. I didn't notice that per-vcpu hypervisor specific stats are
also snuck into the VIR_DOMAIN_STATS_VCPU group along with other
pre-existing stats we do guarantee.
Extend the disclaimer for VIR_DOMAIN_STATS_VCPU too.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>
Guests are allowed to change their MAC addresses. Subsequently,
we may respond to that with tweaking that part of host side
configuration that depends on it. In this particular case: QoS.
Some parts of QoS are in fact set on corresponding bridge, where
overall view on traffic can be seen. Here, TC filters are used to
place incoming packets into qdiscs. These filters match source
MAC address. Therefore, upon guest changing its MAC address, the
corresponding TC filter needs to be updated too. This is done by
simply removing the old one and instantiating a new one, with new
MAC address.
Now, u32 filters (which we use) use a hash table for matching,
internally. And when deleting the old filter, we used to remove
the hash table (ID = 800::) and let the new filter instantiate
new hash table. This used to work, until kernel release 4.20
(specifically commit v4.20-rc1~27^2~131^2~11 and its friends)
where this practice was turned into error.
But that's okay - we can delete the specific filter we are after
and not touch the hash table at all.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>
When setting up QoS for a domain <interface/>, or when reporting
its statistics we may need to swap TX/RX values. This is all
explained in comment to virDomainNetTypeSharesHostView().
However, this function claims that VIR_DOMAIN_NET_TYPE_ETHERNET
also shares the 'host view', meaning the TX/RX values must be
swapped. But that's not true.
An easy reproducer is to start a domain with two <interface/>-s:
one type of network, the other of type ethernet and configure the
same <bandwidth/> for both. Reversed setting can then be observed
(e.g. via tc).
Reported-by: Oleg Vasilev <oleg.vasilev@virtuozzo.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>
Our RPC calls can be divided into two groups: regular and high
priority. The latter can be then processed by so called high
priority worker threads. This is our way of defeating a
'deadlock' and allowing some RPCs to be processed even when all
(regular) worker threads are stuck. For instance: if all regular
worker threads get stuck when talking to QEMU on monitor, the
virDomainDestroy() can be processed by a high priority worker
thread(s) and thus unstuck those threads.
Now, this is all fine, except if users want to use virsh
non interactively:
virsh destroy $dom
This does a bit more - it needs to open a connection. And that
consists of multiple RPC calls: AUTH_LIST,
CONNECT_SUPPORTS_FEATURE, CONNECT_OPEN, and finally
CONNECT_REGISTER_CLOSE_CALLBACK. All of them are marked as high
priority except the last one. Therefore, virsh just sits there
with a partially open connection.
There's one requirement for high priority calls though: they can
not get stuck. Hopefully, the reason is obvious by now. And
looking into the server side implementation the
CONNECT_REGISTER_CLOSE_CALLBACK processing can't ever get stuck.
The only driver that implements the callback for public API is
Parallels (vz). And that can't block really.
And for virConnectUnregisterCloseCallback() it's the same story.
Therefore, both can be marked as high priority.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2143840
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Use same style in the 'struct option' as:
struct option opt[] = {
{ a, b },
{ a, b },
...
{ a, b },
};
Signed-off-by: Jiang Jiacheng <jiangjiacheng@huawei.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
For the handling of usb we already allow plenty of read access,
but so far /sys/bus/usb/devices only needed read access to the directory
to enumerate the symlinks in there that point to the actual entries via
relative links to ../../../devices/.
But in more recent systemd with updated libraries a program might do
getattr calls on those symlinks. And while symlinks in apparmor usually
do not matter, as it is the effective target of an access that has to be
allowed, here the getattr calls are on the links themselves.
On USB hostdev usage that causes a set of denials like:
apparmor="DENIED" operation="getattr" class="file"
name="/sys/bus/usb/devices/usb1" comm="qemu-system-x86"
requested_mask="r" denied_mask="r" ...
It is safe to read the links, therefore add a rule to allow it to
the block of rules that covers the usb related access.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
Reviewed-by: Michal Privoznik <mprivozn at redhat.com>
When there is no vIOMMU, vfio devices don't need to lock the entire guest
memory per-device, but they still need to lock the entire guest memory to
share between all vfio devices. This memory accounting is not shared
with vDPA devices, so it should be added to the memlock limit separately.
Commit 8d5704e2 added support for multiple vfio/vdpa devices but
calculated the limits incorrectly when there were both vdpa and vfio
devices and no vIOMMU. In this case, the memory lock limit was not
increased separately for the vfio devices.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2143838
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Laine Stump <laine@redhat.com>
When post-copy migration is running in Finish phase we already did
everything needed and we're just waiting for all the memory to transfer
to the destination. The domain is already running on there at this
point. Once all data is transferred (QEMU sends a MIGRATION completed
event) we're done. So in this specific post-copy case the source does
not need to care about the result of the Finish call as long as QEMU
says migration completed. The Finish call to the destination daemon may
fail for reasons that do not affect QEMU, e.g., libvirt daemon was
restarted there or the libvirt connection broke.
Currently we just mark the post-copy migration as failed on the source
and keep the domain paused there. But when libvirt daemon is restarted
at this point, it will detect migration finished successfully and kill
the domain as migrated. It make sense to do this even without having to
restart the daemon.
Closes: https://gitlab.com/libvirt/libvirt/-/issues/338
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
We need the restored job even in case the migration already finished
even though we will stop it just a few lines below as the functions we
call in between require an existing migration job.
This fixes a crash on reconnect when post-copy migration finished while
the daemon was not running.
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
These format are left unchanged when convert 'unsigned long' to
'unsigned long long', which caused compile warning.
Signed-off-by: Jiang Jiacheng <jiangjiacheng@huawei.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Ján Tomko <jtomko@redhat.com>
If the running firewalld doesn't support getPolicies() then we fallback
to the "libvirt" zone. Throwing an error log is excessive since we
gracefully fallback.
Avoids these logs:
error : virGDBusCallMethod:242 : error from service: \
GDBus.Error:org.freedesktop.DBus.Error.UnknownMethod
Fixes: ab56f84976e0 ("util: add virFirewallDGetPolicies()")
Signed-off-by: Eric Garver <eric@garver.life>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Register virDomainMemoryDefFree() to do the cleanup.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Tim Wiederhake <twiederh@redhat.com>
This is new allocator for virDomainMemoryDef struct which also
sets some default values: @model and @targetNode.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Tim Wiederhake <twiederh@redhat.com>
The idea here is that virVMXConfigScanResultsCollector() sets the
networks_max_index to the highest ethernet index seen. Well, the
struct member is signed int, we parse just seen index into uint
and then typecast to compare the two. This is not necessary,
because the maximum number of NICs a vSphere domain can have is
(<drumrolll/>): ten [1]. This will fit into signed int easily
anywhere.
1: https://configmax.esp.vmware.com/guest?vmwareproduct=vSphere&release=vSphere%208.0&categories=1-0
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Tim Wiederhake <twiederh@redhat.com>
Now that we have STRCASESKIP() there's no need to open code it.
Convert virVMXConfigScanResultsCollector() so that it uses this
new macro.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Tim Wiederhake <twiederh@redhat.com>
There is so far one case where STRCASEPREFIX(a, b) && a +
strlen(b) combo is used (in virVMXConfigScanResultsCollector()),
but there will be more. Do what we do usually: introduce a macro.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Tim Wiederhake <twiederh@redhat.com>
When generating memory for main guest memory memory-backend-*
might be used. This means, we may need to generate thread-context
objects too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
When generating memory for memory devices memory-backend-* might
be used. This means, we may need to generate thread-context
objects too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
When generating memory for guest NUMA memory-backend-* might be
used. This means, we may need to generate thread-context objects
too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
While technically thread-context objects can be reused, we only
use them (well, will use them) to pin memory allocation threads.
Therefore, once we connect to QEMU monitor, all memory (with
prealloc=yes) was allocated and thus these objects are no longer
needed and can be removed. For on demand allocation the TC object
is left behind.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
The aim of thread-context object is to set affinity on threads
that allocate memory for a memory-backend-* object. For instance:
-object '{"qom-type":"thread-context","id":"tc-ram-node0","node-affinity":[3]}' \
-object '{"qom-type":"memory-backend-memfd","id":"ram-node0","hugetlb":true,\
"hugetlbsize":2097152,"share":true,"prealloc":true,"prealloc-threads":8,\
"size":15032385536,"host-nodes":[3],"policy":"preferred",\
"prealloc-context":"tc-ram-node0"}' \
allocates 14GiB worth of memory, backed by 2MiB hugepages from
host NUMA node 3, using 8 threads. If it weren't for
thread-context these threads wouldn't have any affinity and thus
theoretically could be scheduled to run on CPUs of different NUMA
node (which is what I saw occasionally).
Therefore, whenever we are pinning memory (IOW setting host-nodes
attribute), we can generate thread-context object with the same
affinity.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
In its commit v7.1.0-1429-g7208429223 QEMU gained new object
thread-context, which allows running specialized tasks with
affinity set to a given subset of host CPUs/NUMA nodes. Even
though only memory allocation task accepts this new object, it's
exactly what we aim to implement in libvirt. Therefore, introduce
a new capability to track whether QEMU is capable of this object.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
On aarch64 the 'id' file is not present for CPU cache information in
sysfs. This causes the local stateful hypervisor drivers to fail to
initialize capabilities:
virStateInitialize:657 : Initialisation of cloud-hypervisor state driver failed: no error
The 'no error' is because the 'virFileReadValueNNN' methods return
ret==-2, with no error raised, when the requeted file does not exist.
None of the callers were checking for this scenario when populating
capabilities. The most graceful way to handle this is to skip the
cache bank in question. This fixes failure to launch libvirt drivers
on certain aarch64 hardware.
Fixes: https://gitlab.com/libvirt/libvirt/-/issues/389
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
This lets us simplify the cleanup paths when populating the host cache
bank information in capabilities XML.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Inside virt-qemu-run, just like in virsh for example, there is no
identity set in the current thread, so we should not try to set it,
otherwise things like connecting to other drivers might fail and on
top of that there is no error set so the user can't even see what's
wrong.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2000075
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
It is already nonfallible, so just change the return type to void.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
In one of recent commits an error message was introduced. In this
message a variable of type ssize_t is being printed out, but the
corresponding format directive is %ld instead of %zd which breaks
on 32bits systems. Switch to proper format.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
According to the result parsing from xml, add the argument of
SGX EPC memory backend into QEMU command line.
$ qemu-system-x86_64 \
...... \
-object '{"qom-type":"memory-backend-epc","id":"memepc0","prealloc":true,"size":67108864,"host-nodes":[0,1],"policy":"bind"}' \
-object '{"qom-type":"memory-backend-epc","id":"memepc1","prealloc":true,"size":16777216,"host-nodes":[2,3],"policy":"bind"}' \
-machine sgx-epc.0.memdev=memepc0,sgx-epc.0.node=0,sgx-epc.1.memdev=memepc1,sgx-epc.1.node=1
Signed-off-by: Lin Yang <lin.a.yang@intel.com>
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
As advertised in previous commits, QEMU needs to access
/dev/sgx_vepc and /dev/sgx_provision files when SGX memory
backend is configured. And if it weren't for QEMU's namespaces,
we wouldn't dare to relabel them, because they are system wide
files. But if namespaces are used, then we can set label on
domain's private copies, just like we do for /dev/sev.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
This is similar to the previous commit. SGX memory backend needs
to access /dev/sgx_vepc and /dev/sgx_provision. Create these
nodes in domain's private /dev when required by domain's config.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
SGX memory backend needs to access /dev/sgx_vepc (which allows
userspace to allocate "raw" EPC without an associated enclave)
and /dev/sgx_provision (which allows creating provisioning
enclaves). Allow these two devices in CGroups if a domain is
configured so.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Extend hypervisor capabilities to include sgx feature. When available,
the hypervisor supports launching an VM with SGX on Intel platfrom.
The SGX feature tag privides additional details like section size and
sgx1 or sgx2.
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Generate the QMP command for query-sgx-capabilities and the command
return SGX capabilities from QMP.
{"execute":"query-sgx-capabilities"}
the right reply:
{"return":
{
"sgx": true,
"section-size": 197132288,
"flc": true
}
}
the error reply:
{"error":
{"class": "GenericError", "desc": "SGX is not enabled in KVM"}
}
Signed-off-by: Haibin Huang <haibin.huang@intel.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
JSON args for -netdev were added as precursor for adding the 'dgram'
network backend type. Enable the detection and update test cases using
DO_TEST_CAPS_LATEST.
Enabling the capability also ensures that the -netdev argument is
validated against the QAPI schema of 'netdev_add' which was already
implemented but not enabled.
The parser supporting JSON was added by qemu commit f3eedcddba3 and
enabled when adding stream/dgram netdevs in commit 5166fe0ae46.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Certain udev entries might be of a size that makes libudev emit EINVAL
which right now leads to udevEventHandleThread exiting. Due to no more
handling events other elements of libvirt will start pushing for events
to be consumed which never happens causing a busy loop burning a cpu
without any gain.
After evaluation of the example case discussed in in #245 and a test
run ignoring EINVAL it was considered safe to add EINVAL to the ignored
errnos to not exit udevEventHandleThread giving it more resilience.
The root cause is in systemd and by now was discussed and fixed via
https://github.com/systemd/systemd/issues/24987, but hardening libvirt
to be able to better deal with EINVAL returned still is the right thing
to avoid the reported busy loops on systemd with older systemd versions.
Fixes: https://gitlab.com/libvirt/libvirt/-/issues/245
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
clang 14.0.5 complains:
../src/bhyve/bhyve_device.c:42:29: error: mixing declarations and code
is incompatible with standards before C99
[-Werror,-Wdeclaration-after-statement]
virDomainPCIAddressSet *addrs = opaque;
^
1 error generated.
And a few similar errors in some other places, mainly bhyve related.
Apply a trivial fix to resolve that.
Signed-off-by: Roman Bogorodskiy <bogorodskiy@gmail.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
All callers pass the equivalent of looking up whether qemu supports
QEMU_CAPS_QMP_QUERY_NAMED_BLOCK_NODES_FLAT. Use
'mon->queryNamedBlockNodesFlat' directly and refactor all callers.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
'query-named-block-nodes' in non-flat mode returns redundantly nested
data under the 'backing-image' field. Fortunately we don't need it when
updating the capacity stats.
This function was unfortunately not fixed originally when the support
for flat mode was added. Use the flat cached in the monitor object to
force flat mode if available.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Rather than having callers always pass this flag store it in the
qemuMonitor object. Following patches will convert the code to use this
internal flag.
In the future this will also simplify removal when all supported qemu
versions will support the new mode.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
We don't need automatic freeing for 'blockNamedNodeData' and we can
directly return it rather than checking it for NULL-ness first.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
qemu-6.2 introduced support for the hv-avic enlightenment which allows
to use Hyper-V SynIC with hardware APICv/AVIC enabled.
Implement the libvirt support for it.
Closes: https://gitlab.com/libvirt/libvirt/-/issues/402
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Based on qemu commit e1f9a8e8c90ae54387922e33e5ac4fd759747d01 introduce
the hv-avic feature in leaf 0x40000004, EAX 0x00000200 (1 << 9).
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
'VIR_CPU_x86_HV_STIMER_DIRECT' is reported under leaf 0x40000003,
but the data is in the EDX register. Create a new group for such
features and move them after the 0x40000003 EAX group.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
In recent commits migration of TPM on shared storage was
introduced. However, I've only complied it with gcc and thus did
not notice that clang build fails due to missing break; at the
end of some (empty) cases in switch() statements.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Never remove the TPM state on outgoing migration if the storage setup
has shared storage for the TPM state files. Also, do not do the security
cleanup on outgoing migration if shared storage is detected.
Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>