Commit Graph

1723 Commits

Author SHA1 Message Date
Jiri Denemark
66643931e7 qemu: Add support for /dev/userfaultfd
/dev/userfaultfd device is preferred over userfaultfd syscall for
post-copy migrations. Unless qemu driver is configured to disable mount
namespace or to forbid access to /dev/userfaultfd in cgroup_device_acl,
we will copy it to the limited /dev filesystem QEMU will have access to
and label it appropriately. So in the default configuration post-copy
migration will be allowed even without enabling
vm.unprivileged_userfaultfd sysctl.

Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2024-02-13 17:44:26 +01:00
Praveen K Paladugu
4bfd513d92 hypervisor: Move domain interface mgmt methods
Move domain interface management methods from qemu to hypervisor. This
refactoring allows the domain management methods to be shared between CH and
qemu drivers.

This commit does not introduce any functional changes.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2024-02-02 10:58:26 +01:00
Praveen K Paladugu
a22d7fde17 conf: Drop unused parameter
Drop unused parameter from virDomainNetReleaseActualDevice method.

Signed-off-by: Praveen K Paladugu <prapal@linux.microsoft.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2024-02-02 10:58:21 +01:00
Michal Privoznik
bee5301afa qemu_process: Skip over non-virtio non-TAP NIC models when refreshing rx-filter
After guest is started, or we are reconnecting to already running
one (after daemon restart), qemuProcessRefreshRxFilters() is
called to refresh rx-filters (basically MAC addresses of guest
NICs) as they might have changed while we were not running (for
the case when reconnecting to an already running guest), or we
need to enable them by running a command (for freshly started
guest - see processNicRxFilterChangedEvent()).

Now, our XML parser allowed trustGuestRxFilters attribute for all
types and models of <interface/> while in reality, only virtio
model AND TUN/TAP based types can see MAC address changes. For
other combinations, QEMU reports an error.

This all means that when the daemon is restarted and it
reconnects to a guest with, well invalid configuration, or when
such guest is restored from a saved image, or migrated then we
issue the monitor command, to which QEMU replies with an error
which is then propagated to users:

  error: internal error: unable to execute QEMU command 'query-rx-filter': invalid net client name: hostdev0

While on one hand users should fix their configuration (and after
v10.0.0-rc1~123 they can do that even on live domains), libvirt
can also has some logic built in that prevent issuing the command
in the first place (for obviously wrong cases).

Fixes: 060d4c83ef
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2024-01-25 15:55:33 +01:00
Peter Krempa
2da71d8e43 qemu: process: Separate setup of network device objects
Separate the SLIRP bits from 'qemuProcessNetworkPrepareDevices' and do
the setup of the internal data when setting up domain data.

This will allow tests to use the same code path to lookup data for a
network.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2024-01-04 22:26:10 +01:00
Artem Chernyshev
d05cdd1879 virprocess: virProcessGetNamespaces() to void
virProcessGetNamespaces() return value is invariant, so change it
type and remove all dependent checks.

Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2024-01-04 17:06:14 +01:00
Guoyi Tu
dd2f36d66e qemu_driver: Don't handle the EOF event if vm get restarted
Currently, libvirt creates a thread pool with only on thread to handle all
qemu monitor events for virtual machines, In the cases that if the thread
gets stuck while handling a monitor EOF event, such as unable to kill the
virtual machine process or release resources, the events of other virtual
machine will be also blocked, which will lead to the abnormal behavior of
other virtual machines.

For instance, when another virtual machine completes a shutdown operation
and the monitor EOF event has been queued but remains unprocessed, we
immediately destroy and start the virtual machine again, at a later time
when EOF event get processed, the processMonitorEOFEvent() will kill the
virtual machine that just started.

To address this issue, in the processMonitorEOFEvent(), we check whether
the current virtual machine's id is equal to the the one at the time
the event was generated. If they do not match, we immediately return.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Guoyi Tu <tugy@chinatelecom.cn>
Signed-off-by: dengpengcheng <dengpc12@chinatelecom.cn>
2024-01-03 17:13:23 +00:00
Peter Krempa
69880584e6 qemuProcessStartWithMemoryState: Don't start qemu with '-loadvm SNAP' and '-incoming defer' together
A bug in qemuProcessStartWithMemoryState caused that we would start qemu
with '-loadvm SNAP' and '-incoming defer' together.  qemu doesn't expect
that and crashes on an assertion failure [1].

[1]: https://issues.redhat.com/browse/RHEL-16782

Fixes: 8a88d3e586
Resolves: https://issues.redhat.com/browse/RHEL-17841
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
2023-12-01 11:35:14 +01:00
Michal Privoznik
cfcbba4c2b lib: Replace qsort() with g_qsort_with_data()
While glibc provides qsort(), which usually is just a mergesort,
until sorting arrays so huge that temporary array used by
mergesort would not fit into physical memory (which in our case
is never), we are not guaranteed it'll use mergesort. The
advantage of mergesort is clear - it's stable. IOW, if we have an
array of values parsed from XML, qsort() it and produce some
output based on those values, we can then compare the output with
some expected output, line by line.

But with newer glibc this is all history. After [1], qsort() is
no longer mergesort but introsort instead, which is not stable.
This is suboptimal, because in some cases we want to preserve
order of equal items. For instance, in ebiptablesApplyNewRules(),
nwfilter rules are sorted by their priority. But if two rules
have the same priority, we want to keep them in the order they
appear in the XML. Since it's hard/needless work to identify
places where stable or unstable sorting is needed, let's just
play it safe and use stable sorting everywhere.

Fortunately, glib provides g_qsort_with_data() which indeed
implement mergesort and it's a drop in replacement for qsort(),
almost. It accepts fifth argument (pointer to opaque data), that
is passed to comparator function, which then accepts three
arguments.

We have to keep one occurance of qsort() though - in NSS module
which deliberately does not link with glib.

1: https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=03bf8357e8291857a435afcc3048e0b697b6cc04
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-11-24 09:53:14 +01:00
Pavel Hrdina
4f4a8dce94 qemu_process: fix crash in qemuSaveImageDecompressionStart
Commit changing the code to allow passing NULL as @data into
qemuSaveImageDecompressionStart() was not correct as it left the
original call into the function as well.

Introduced-by: 2f3e582a1a
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2247754
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-11-03 14:17:06 +01:00
Peter Krempa
61baeb1152 qemu: process: Extract host setup of disk device into helpers
Currently the code sets up only VDPA backends but will be used later in
hotplug code too.

This patch also uses normal forward iteration in the loop in
qemuProcessPrepareHostStorage as we don't need to remove disks from the
disk list at that point.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-10-27 15:04:20 +02:00
Peter Krempa
3781988107 qemu: Refactor storage backend 'storage' layer helepr object setup
Use the new nodename accessors for any storage layer helper object.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-10-17 14:16:16 +02:00
Pavel Hrdina
2f3e582a1a qemuProcessStartWithMemoryState: make it possible to use without data
When used with internal snapshots there is no memory state file so we
have no data to load and decompression is not needed.

Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-10-09 13:56:50 +02:00
Pavel Hrdina
8a88d3e586 qemuProcessStartWithMemoryState: add snapshot argument
When called from snapshot code we will need to pass snapshot object in
order to make internal snapshots work correctly.

Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-10-09 13:56:49 +02:00
Pavel Hrdina
6a88060d32 qemuProcessStartWithMemoryState: allow setting reason for audit log
When called by snapshot code we will need to use different reason.

Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-10-09 13:56:49 +02:00
Pavel Hrdina
6c0f30b37e qemu_saveimage: move qemuSaveImageStartProcess to qemu_process
The function will no longer be used only when restoring VM as it will
be used when reverting snapshot as well so move it to qemu_process
and rename it accordingly.

Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-10-09 13:56:49 +02:00
Jonathon Jongsma
447e09dfdb qemu: Monitor nbdkit process for exit
Adds the ability to monitor the nbdkit process so that we can take
action in case the child exits unexpectedly.

When the nbdkit process exits, we pause the vm, restart nbdkit, and then
resume the vm. This allows the vm to continue working in the event of a
nbdkit failure.

Eventually we may want to generalize this functionality since we may
need something similar for e.g. qemu-storage-daemon, etc.

The process is monitored with the pidfd_open() syscall if it exists
(since linux 5.3). Otherwise it resorts to checking whether the process
is alive once a second. The one-second time period was chosen somewhat
arbitrarily.

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-19 14:28:50 -05:00
Jonathon Jongsma
dfa657aa27 qemu: include nbdkit state in private xml
Add xml to the private data for a disk source to represent the nbdkit
process so that the state can be re-created if the libvirt daemon is
restarted. Format:

   <nbdkit>
     <pidfile>/path/to/nbdkit.pid</pidfile>
     <socketfile>/path/to/nbdkit.socket</socketfile>
   </nbdkit>

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-19 14:28:50 -05:00
Jonathon Jongsma
e498941476 qemu: move qemuProcessReadLog() to qemuLogContext
This code can be used by the nbdkit implementation for reading back
filtered log data for error reporting. Move it to qemuLogContext so that
it can be shared. Renamed to qemuLogContextReadFiltered().

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-19 14:28:50 -05:00
Jonathon Jongsma
b658b1a27e qemu: Extract qemuDomainLogContext into a new file
This will allow us to use it for nbdkit logging in upcoming commits.

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-19 14:28:50 -05:00
Jonathon Jongsma
abdc4f2092 Generalize qemuDomainLogContextNew()
Allow to specify a basename for the log file so that
qemuDomainLogContextNew() can be used to create log contexts for
secondary loggers.

Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-19 14:28:50 -05:00
Jonathon Jongsma
4ef2bcfd3f qemu: Implement support for vDPA block devices
Requires recent qemu with support for the virtio-blk-vhost-vdpa device
and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1900770
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
2023-09-12 11:06:41 -05:00
Peter Krempa
24b769a25b qemu: capabilities: Remove unused 'virQEMUCapsFilterByMachineType'
The filtering of qemu capabilities by machine type doesn't seem to be
ever used, remove it and adjust callers.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2023-09-04 10:31:52 +02:00
Peter Krempa
a9e71cb737 qemu: process: Probe machine type data on reconnect to qemu
When reconnecting we populate only the capability flags from the XML as
we need to know the exact flags that were present when starting the VM.

On the other hand the machine type data is not stored as it wasn't
really used after startup. While storing all of the data into the status
XML would be theoretically possible, with machine-type specific data it
makes no sense to do so, and thus the data can be re-probed from the
current instance.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2023-09-04 10:31:52 +02:00
Michal Privoznik
895525db81 qemu: Move error messages onto a single line
Error messages are exempt from the 80 columns rule. Move them
onto one line.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
2023-09-04 09:35:36 +02:00
Michal Privoznik
7d01b67323 src: Move _virDomainMemoryDef target nodes into an union
The _virDomainMemoryDef struct is getting a bit messy. It has
various members and only some of them are valid for given model.
Worse, some are re-used for different models. We tried to make
this more bearable by putting a comment next to each member
describing what models the member is valid for, but that gets
messy too.

Therefore, do what we do elsewhere: introduce an union of structs
and move individual members into their respective groups.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-08-24 12:39:26 +02:00
Michal Privoznik
f23a991bea src: Move _virDomainMemoryDef source nodes into an union
The _virDomainMemoryDef struct is getting a bit messy. It has
various members and only some of them are valid for given model.
Worse, some are re-used for different models. We tried to make
this more bearable by putting a comment next to each member
describing what models the member is valid for, but that gets
messy too.

Therefore, do what we do elsewhere: introduce an union of structs
and move individual members into their respective groups.

This allows us to shorten some names (e.g. nvdimmPath or
sourceNodes) as their purpose is obvious due to their placement.
But to make this commit as small as possible, that'll be
addressed later.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-08-24 12:39:23 +02:00
Ján Tomko
2bad705ebb qemu: remove pointless qemuDomainLogContextMode
Since its introduction in 4d1b771fbb
it has only been used to differentiate between START and non-START.

Last use of QEMU_DOMAIN_LOG_CONTEXT_MODE_ATTACH was removed by:

  commit f709377301
    qemu: Fix qemuDomainObjTaint with virtlogd

QEMU_DOMAIN_LOG_CONTEXT_MODE_STOP is unused since:

  commit cf3ea0769c
    qemu: process: Append the "shutting down" message using the new APIs

Now, the only caller passes QEMU_DOMAIN_LOG_CONTEXT_MODE_START.
Assume that's always the case and remove the 'mode' argument.

Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2023-08-23 15:25:29 +02:00
Andrea Bolognani
b845e376a4 qemu: Match NVRAM template extension for new domains
Keep things consistent by using the same file extension for the
generated NVRAM path as the NVRAM template.

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2023-08-21 13:51:32 +02:00
Michal Privoznik
1ca3c339a1 lib: Prefer sizeof(variable) instead of sizeof(type) in memset
If one of previous commits taught us something, it's that:
sizeof(variable) and sizeof(type) are not the same. Especially
because for live enough code the type might change (e.g. as we
use autoptr more). And since we don't get any warnings when an
incorrect length is passed to memset() it is easy to mess up. But
with sizeof(variable) instead, it's not as easy. Therefore,
switch to using memset(variable, 0, sizeof(*variable)), or its
alternatives, depending on level of pointers.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Claudio Fontana <cfontana@suse.de>
2023-08-03 16:41:19 +02:00
Michal Privoznik
b20a5e9a4d lib: use struct zero initializer instead of memset
This is a more concise approach and guarantees there is
no time window where the struct is uninitialized.

Generated using the following semantic patch:

  @@
  type T;
  identifier X;
  @@
  -  T X;
  +  T X = { 0 };
     ... when exists
  (
  -  memset(&X, 0, sizeof(X));
  |
  -  memset(&X, 0, sizeof(T));
  )

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Claudio Fontana <cfontana@suse.de>
2023-08-03 16:41:19 +02:00
Nikolai Barybin
2d6659e778 qemu: prevent SIGSEGV in qemuProcessHandleDumpCompleted
If VIR_ASYNC_JOB_NONE flag is present, job.current is equal
to NULL, which leads to SIGSEGV. Thus, this check should be
moved up.

Fixes: v8.0.0-427-gf304de0df6
Signed-off-by: Nikolai Barybin <nikolai.barybin@virtuozzo.com>
Reviewed-by: Jiri Denemark <jdenemar@redhat.com>
2023-06-27 12:39:50 +02:00
Michal Privoznik
d09b73b560 qemu: Drop @unionMems argument from qemuProcessSetupPid()
The @unionMems argument of qemuProcessSetupPid() function is not
necessary really as all callers pass 'true'. Drop it.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-06-08 09:39:20 +02:00
Michal Privoznik
83adba541a qemu: Allow more generous cpuset.mems for vCPUs and IOThreads
The unit that cpuset CGroups controller works with is a
thread/process, not individual memory allocations. Therefore,
after we've set cpuset.mems for emulator (after previous commit
it's set to union of all host NUMA nodes allowed for given
domain), and as we try to set up cpuset.mems for vCPUs/IOThreads,
memory is migrated to selected NUMA node(s). We are effectively
saying: "this thread (vCPU thread) can have memory only from
these NUMA node(s)".

That's not really what we want though. The cpuset controller
doesn't differentiate memory "belonging" to the emulator thread
and vCPU thread or IOThread even.

Therefore, set union of all allowed host NUMA nodes, just like
we're doing for the emulator thread.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-06-08 09:39:20 +02:00
Michal Privoznik
fddbb2f12f qemu: Don't try to 'fix up' cpuset.mems after QEMU's memory allocation
In ideal world, my plan was perfect. We allow union of all host
nodes in cpuset.mems and once QEMU has allocated its memory, we
'fix up' restriction of its emulator thread by writing the
original value we wanted to set all along. But in fact, we can't
do it because that triggers memory movement. For instance,
consider the following <numatune/>:

  <numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="1" mode="strict" nodeset="1"/>
  </numatune>

  <numa>
    <cell id="0" cpus="0-1" memory="1024000" unit="KiB" />
    <cell id="1" cpus="2-3" memory="1048576" unit="KiB"/>
  </numa>

This is meant to create 1:1 mapping between guest and host NUMA
nodes. So we start QEMU with cpuset.mems set to "0-1" (so that it
can allocate memory even for guest node #1 and have the memory
come fro host node #1) and then, set cpuset.mems to "0" (because
that's where we wanted emulator thread to live).

But this in turn triggers movement of all memory (even the
allocated one) to host NUMA node #0. Therefore, we have to just
keep cpuset.mems untouched and rely on .host-nodes passed on the
QEMU cmd line.

The placement still suffers because of cpuset.mems set for vcpus
or iothreads, but that's fixed in next commit.

Fixes: 3ec6d586bc
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-06-08 09:39:20 +02:00
Peter Krempa
9d6867198d qemuMonitorSetBlockIoThrottle: Drop 'diskalias' argument
Every caller will pass 'qdevid' as it's populated in the data
mandatorily with qemu-4.2 and onwards due to mandatory -blockdev use.

Thus we can drop compatibility with the old way of matching the disk via
alias.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-06-05 13:20:13 +02:00
Michal Privoznik
8b9d2bda8a qemu: Set proper PCI backend for <interface/>-s that are actually hostdevs
When starting a domain, it's done so in two steps (actually more,
but lets focus on just the following two):

  1) qemuProcessPrepareDomain(), followed by

  2) qemuProcessPrepareHost().

Now, in the first step (PrepareDomain()), PCI backends for all
hostdevs is set (qemuProcessPrepareDomain() ->
qemuProcessPrepareDomainHostdevs() -> qemuDomainPrepareHostdev()
-> qemuDomainPrepareHostdevPCI()). Perfect.

But then, additional hostdevs may appear, because in the host
prepare phase we may insert some hostdevs into domain definition
(qemuProcessPrepareHost() -> qemuProcessNetworkPrepareDevices()).

Now, these additional hostdevs don't undergo the same prepare as
hostdevs that were already present in the domain definition (i.e.
in qemuProcessPrepareDomain() phase). Therefore, we have to call
corresponding prepare function explicitly.

NB, the interface hotplug code (qemuDomainAttachNetDevice()) does
not suffer from this problem, because it calls top level
qemuDomainAttachHostDevice() which is used to hotplug regular
hostdevs too and as such calls qemuDomainPrepareHostdev().

Fixes: 3b87709c76
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2209853
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-06-05 12:18:53 +02:00
Michal Privoznik
e53291514c qemu_hotplug: Temporarily allow emulator thread to access other NUMA nodes during mem hotplug
Again, this fixes the same problem as one of previous commits,
but this time for memory hotplug. Long story short, if there's a
domain running and the emulator thread is restricted to a subset
of host NUMA nodes, but the memory that's about to be hotplugged
requires memory from a host NUMA node that's not in the set we
need to allow emulator thread to access the node, temporarily.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-05-23 17:21:16 +02:00
Michal Privoznik
3ec6d586bc qemu: Start emulator thread with more generous cpuset.mems
Consider a domain with two guest NUMA nodes and the following
<numatune/> setting :

  <numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="1"/>
  </numatune>

What this means is the emulator thread is pinned onto host NUMA
node #0 (by setting corresponding cpuset.mems to "0"), and two
memory-backend-* objects are created:

  -object '{"qom-type":"memory-backend-ram","id":"ram-node0", .., "host-nodes":[1],"policy":"bind"}' \
  -numa node,nodeid=0,cpus=0-1,memdev=ram-node0 \
  -object '{"qom-type":"memory-backend-ram","id":"ram-node1", .., "host-nodes":[0],"policy":"bind"}' \
  -numa node,nodeid=1,cpus=2-3,memdev=ram-node1 \

Note, the emulator thread is pinned well before QEMU is even
exec()-ed.

Now, the way memory allocation works in QEMU is: the emulator
thread calls mmap() followed by mbind() (which is sane, that's
how everybody should do it). BUT, because the thread is already
restricted by CGroups to just NUMA node #0, calling:

  mbind(host-nodes:[1]); /* made up syntax (TM) */

fails. This is expected though. Kernel was instructed to place
the memory at NUMA node "0" and yet, process is trying to place
it elsewhere.

We used to solve this by not restricting emulator thread at all
initially, and only after it's done initializing (i.e. we got the
QMP greeting) we placed it onto desired nodes. But this had its
own problems (e.g. QEMU might have locked pieces of its memory
which were then unable to migrate onto different NUMA nodes).

Therefore, in v5.1.0-rc1~282 we've changed this and set cgroups
upfront (even before exec()-ing QEMU). And this used to work, but
something has changed (I can't really put my finger on it).

Therefore, for the initialization start the thread with union of
all configured host NUMA nodes ("0-1" in our example) and fix the
placement only after QEMU is started.

NB, the memory hotplug suffers the same problem, but that will
be fixed in the next commit.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2138150
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-05-23 17:21:16 +02:00
Michal Privoznik
c4a7f8007c qemuProcessSetupPid: Use @numatune variable more
Inside of qemuProcessSetupPid() there's @numatune variable which
is set to vm->def->numa, but it lives only in one block. In the
rest of places the expanded form (vm->def->numa) is used instead.
Move the variable declaration at the beginning of the function
and use it instead of the expanded form.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-05-23 17:21:16 +02:00
Michal Privoznik
37e41b7f16 qemu: Drop @forceVFIO argument of qemuDomainGetMemLockLimitBytes()
After previous cleanup, there's not a single caller that would
call qemuDomainGetMemLockLimitBytes() with @forceVFIO set. All
callers pass false.

Drop the unneeded argument from the function.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-05-16 14:43:43 +02:00
Andrea Bolognani
934113d376 qemu: Find helpers at runtime
Use the recently introduced virFindFileInPathFull() function to
discover the path for qemu-bridge-helper and qemu-pr-helper at
runtime.

Note that it's still possible for the administrator to prevent
this lookup and use arbitrary binaries by setting the
appropriate keys in qemu.conf: this simply removes the need to
perform the lookup at build time, and thus to have the helpers
installed in the build environment.

Signed-off-by: Andrea Bolognani <abologna@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-05-10 18:54:09 +02:00
Michal Privoznik
4644aba0b0 qemu: Stop virQEMUCaps propagation into qemuHostdevPreparePCIDevices()
After previous cleanups, qemuHostdevPreparePCIDevices() no longer
needs virQEMUCaps. Drop its passing from callers.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-04-25 12:36:31 +02:00
Michal Privoznik
430fc2ec26 qemu: Remove empty functions
After previous cleanup, there are some functions that do nothing:

  qemuConnectDomainXMLToNativePrepareHostHostdev()
  qemuConnectDomainXMLToNativePrepareHost()
  qemuProcessPrepareHostHostdev()
  qemuProcessPrepareHostHostdevs()

Remove them.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-04-25 12:36:31 +02:00
Michal Privoznik
fea0d8c40d qemu: Move <hostdev> SCSI path generation into qemuDomainPrepareHostdev()
When preparing a SCSI <hostdev/> with passthrough of a host SCSI
adapter (i.e. no protocol), a virStorageSource structure is
initialized and stored inside virDomainHostdevDef. But the source
structure is filled in many places, with almost the same code.

Firstly, qemuProcessPrepareHostHostdev() and
qemuConnectDomainXMLToNativePrepareHostHostdev() are the same.

Secondly, qemuDomainPrepareHostdev() allocates the src structure,
only to let qemuProcessPrepareHostHostdev() fill src->path later.

Well, src->path can be filled at the same place where the src
structure is allocated (qemuDomainPrepareHostdev()) which renders
the other two functions needless.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
2023-04-25 12:36:30 +02:00
Peter Krempa
b60efa9a39 qemuProcessRefreshDisks: Extract update of a single disk
Extract the logic to update one single disk (without emitting any
events) so that it can be reused when updating the state after a disk
hotplug.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-04-24 12:57:56 +02:00
Peter Krempa
c8e7ed7f7b qemuProcessRefreshDisks: Properly compare tray status
The code compares the 'tray_open' boolean from 'struct
qemuDomainDiskInfo' directly against 'disk->tray_status' which is
declared as virDomainDiskTray (enum). Now the logic works correctly
because the _OPEN enum has value '1'.

Separate the event emission code from the update code and remember the
old tray state in a separate variable rather than having the sneaky
logic we have today.

Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
2023-04-24 12:57:56 +02:00
Martin Kletzander
383caddea1 qemu, ch: Move threads to cgroup dir before changing parameters
With cgroupv2 this has better effect on the resource allocation.  An
excerpt from Documentation/admin-guide/cgroup-v2.rst explains is this
way:

  Migrating a process across cgroups is a relatively expensive operation
  and stateful resources such as memory are not moved together with the
  process.  This is an explicit design decision as there often exist
  inherent trade-offs between migration and various hot paths in terms
  of synchronization cost.

  [...]

  Setting a non-empty value to "cpuset.mems" causes memory of
  tasks within the cgroup to be migrated to the designated nodes if
  they are currently using memory outside of the designated nodes.

Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
2023-04-20 12:39:49 +02:00
Jiri Denemark
49f2835ee3 qemu/qemu_process: Update format strings in translated messages
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
2023-04-01 11:40:34 +02:00
Michal Privoznik
b4ccb0dc41 qemu: Move cpuset preference evaluation into a separate function
The set of if()-s that determines the preference in cpumask used
for setting things like emulatorpin, vcpupin, etc. is going to be
re-used. Separate it out into a function.

You may think that this changes behaviour, but
qemuProcessPrepareDomainNUMAPlacement() ensures that
priv->autoCpuset is set for VIR_DOMAIN_CPU_PLACEMENT_MODE_AUTO.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Kristina Hanicova <khanicov@redhat.com>
Reviewed-by: Andrea Bolognani <abologna@redhat.com>
2023-03-15 12:46:40 +01:00