Links between NUMA nodes can have different latencies and
bandwidths. This info is newly defined in ACPI 6.2 under
Heterogeneous Memory Attribute Table (HMAT) table. Linux kernel
learned how to report these values under sysfs and thus we can
expose them in our capabilities XML. The sysfs interface is
documented in kernel's Documentation/admin-guide/mm/numaperf.rst.
Long story short, two nodes can be in initiator-target
relationship. A node can be initiator if it has a CPU or a device
that's capable of initiating memory transfer. Therefore a node
that has just memory can only be target. An initiator-target link
can then have any combination of {bandwidth, latency} - {access,
read, write} attribute (6 in total). However, the standard says
access is applicable iff read and write values are the same.
Therefore, we really have just four combinations of attributes:
bandwidth-read, bandwidth-write, latency-read, latency-write.
This is the combination that kernel reports anyway.
Then, under /sys/system/devices/node/nodeX/acccessN/initiators we
find values for those 4 attributes and also symlinks named
"nodeN" which then represent initiators to nodeX. For instance:
/sys/system/node/node1/access1/initiators/node0 -> ../../node0
/sys/system/node/node1/access1/initiators/read_bandwidth
/sys/system/node/node1/access1/initiators/read_latency
/sys/system/node/node1/access1/initiators/write_bandwidth
/sys/system/node/node1/access1/initiators/write_latency
This means that node0 is initiator and node1 is target and values
of the interconnect can be read.
In theory, there can be separate links to memory side caches too
(e.g. one link from node X to node Y's main memory, another from
node X to node Y's L1 cache, another one to L2 cache and so on).
But sysfs does not express this relationship just yet.
The "accessN" means either "access0" or "access1". The difference
is that while the former expresses the best interconnect between
two nodes including CPUS and I/O devices (such as GPUs and NICs),
the latter includes only CPUs and thus is what we need.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1786309
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
Memory on a NUMA node can have a side caches. Configuring these
for a domain was implemented in v6.6.0-rc1~249 and friends.
However, up until now mgmt applications did not really know what
values to pass because we were not exposing caches of the host.
With recent enough kernel these are exposed under sysfs and with
a bit of parsing we can extend our capabilities XML. The sysfs
structure is documented in kernel's
Documentation/admin-guide/mm/numaperf.rst and basically maps in
1:1 fashion to our virNumaCache structure.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
It may happen that a NUMA node has no CPUs associated with it. We
allow this for domains since v6.6.0-rc1~250. Let's update our
capabilities schema to match that.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
We supported autostart of node devices via an xml element, but this
is not consistent with other libvirt objects which use an explicit API
for setting autostart status. So revert this and implement it as an
official API in a future commit.
The initial support was refactored after merging, so this commit reverts
both of those previous commits.
Revert "virNodeDevCapMdevParseXML: Use virXMLPropEnum() for ./start/@type"
This reverts commit 9d4cd1d1cd.
Revert "nodedev: support auto-start property for mdevs"
This reverts commit 42a5585499.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
In case the user wants to share the disk image between multiple VMs the
qemu driver needs to hotplug such disks to instantiate the backends.
Since that doesn't work for all disk configs add a switch to force this
behaviour.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
Using slice to cut off the end of the image is a perfectly vaid
configuration. Use 'unsignedInt' instead of 'positiveInteger' for the
'offset' attribute in the XML schema and modify one test case to cover
this use case.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1960993
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
After previous patches we have two structures:
virCapsHostNUMACellDistance and virNumaDistance which express the
same thing. And have the exact same members (modulo their names).
Drop the former in favor of the latter.
This change means that distances with value of 0 are no longer
printed out into capabilities XML, because domain XML code allows
partial distance specification and thus threats value of 0 as
unspecified by user (see virDomainNumaGetNodeDistance() which
returns the default LOCAL/REMOTE distance for value of 0).
Also, from ACPI 6.1 specification, section 5.2.17 System Locality
Distance Information Table (SLIT):
Distance values of 0-9 are reserved and have no meaning.
Thus we shouldn't be ever reporting 0 in neither domain nor
capabilities XML.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
The default value hard-coded in QEMU (64KiB) is not always the ideal.
Having a possibility to set the cluster_size by user may in specific
use-cases improve performance for QCOW2 images.
QEMU internally has some limits, the value has to be between 512B and
2048KiB and must by power of two, except when the image has Extended L2
Entries the minimal value has to be 16KiB.
Since qemu-img ensures the value is correct and the limit is not always
the same libvirt will not duplicate any of these checks as the error
message from qemu-img is good enough:
Cluster size must be a power of two between 512 and 2048k
Resolves: https://gitlab.com/libvirt/libvirt/-/issues/154
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
This adds a new element to the mdev capabilities xml schema that
represents the start policy for a defined mediated device. The actual
auto-start functionality is handled behind the scenes by mdevctl, but it
wasn't yet hooked up in libvirt.
Signed-off-by: Boris Fiuczynski <fiuczy@linux.ibm.com>
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
This adds a new XML element
<filesystem>
<binary>
<sandbox mode='chroot|namespace'/>
</binary>
</filesystem>
This will be used by qemu virtiofs
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Cole Robinson <crobinso@redhat.com>
Allow passing a socket of an externally launched virtiofsd
to the vhost-user-fs device.
<filesystem type='mount'>
<driver type='virtiofs' queue='1024'/>
<source socket='/tmp/sock/'/>
</filesystem>
https://bugzilla.redhat.com/show_bug.cgi?id=1855789
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
This allows users to restrict memory nodes without setting any specific
memory policy, then 'restrictive' mode is useful.
Signed-off-by: Luyao Zhong <luyao.zhong@intel.com>
Signed-off-by: Martin Kletzander <mkletzan@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
It will be useful to be able to specify a particular UUID for a mediated
device when defining the node device. To accomodate that, allow this to
be specified in the xml schema. This patch also parses and formats that
value to the xml, but does not yet use it.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com>
Reviewed-by: Erik Skultety <eskultet@redhat.com>
PCI devices can be associated with a unique integer index that is
exposed via ACPI. In Linux OS with systemd, this value is used for
provide a NIC device naming scheme that is stable across changes
in PCI slot configuration.
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Similar to the qemu.conf knob 'deprecation_behavior' add a per-VM knob
in the QEMU namespace:
<qemu:deprecation behavior='...'/>
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Martin Kletzander <mkletzan@redhat.com>
This lets the app expose the virtual SCSI or IDE disks as solid state
devices by setting a rate of '1', or rotational media by setting a
rate between 1025 and 65534.
https://bugzilla.redhat.com/show_bug.cgi?id=1498955
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The
<os firmware='efi'>
<firmware type='efi'>
<feature enabled='no' name='enrolled-keys'/>
</firmware>
</os>
repeats the firmware attribute twice. This has no functional benefit, as
evidenced by fact that we use a single struct field to store both
attributes, while needlessly introducing an error scenario. The XML can
just be simplified to:
<os firmware='efi'>
<firmware>
<feature enabled='no' name='enrolled-keys'/>
</firmware>
</os>
which also means that we don't need to emit the empty element
<firmware type='efi'/> for all existing configs too.
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When the firmware auto-selection was introduced it always picked first
usable firmware based on the JSON descriptions on the host. It is
possible to add/remove/change the JSON files but it will always be for
the whole host.
This patch introduces support for configuring the auto-selection per VM
by adding users an option to limit what features they would like to have
available in the firmware.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Previously, validation of XML failed if sub-elements of video
device were in different order.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1825769
Signed-off-by: Kristina Hanicova <khanicov@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Ján Tomko <jtomko@redhat.com>
This introduces support for the QEMU audio settings that are common to
all audio backends. These are expressed in the QAPI schema as settings
common to all backends, but in reality some backends ignore some of
them. For example, some backends are output only. The parser isn't
attempting to apply restrictions that QEMU itself doesn't apply.
<audio id='1' type='pulseaudio'>
<input mixingEngine='yes' fixedSettings='yes' voices='1' bufferLength='100'>
<settings frequency='44100' channels='2' format='s16'/>
</input>
<output mixingEngine='yes' fixedSettings='yes' voices='2' bufferLength='100'>
<settings frequency='22050' channels='4' format='f32'/>
</output>
</audio>
The <settings> child is only valid if fixedSettings='yes'
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When there are multiple <audio> backends specified, it is possible to
assign a specific one to the VNC server using
<graphics type='vnc'...>
<audio id='1'/>
</graphics>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The current <audio> element only allows an "OSS" audio backend, as this
is all that BHyve needed. This is now extended to cover most QEMU audio
backends. These backends all have a variety of attributes they support,
but this initial impl does the bare minimum, relying on built-in
defaults for everything. The only QEMU backend omitted is "dsound" since
the libvirt QEMU driver is not built on Windows platforms.
The SDL audio driver names are based on the SDL 2.0 drivers. It is not
intended to support SDL 1.2 drivers.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
To prepare for the introduction for more backend specific audio options,
move the OSS options into a dedicated struct and introduce separate
helper methods for parse/format/free.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The <graphics type="vnc" .... powerControl="yes"/> option instructs the
VNC server to enable an extension that lets the client perform a
graceful shutdown, reboot and hard reset.
This is enabled by default since it cannot be assumed that the VNC
client user has administrator rights over the guest OS. In the case
where the VNC user is a guest administrator though, it is reasonable
to allow direct power control host side too.
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Commit <d505b8af58912ae1e1a211fabc9995b19bd40828> changed the cpu quota
value that reflects what kernel allows but did not update our
documentation.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
The <teaming> element in <interface> allows pairing two interfaces
together as a simple "failover bond" network device in a guest. One of
the devices is the "transient" interface - it will be preferred for
all network traffic when it is present, but may be removed when
necessary, in particular during migration, when traffic will instead
go through the other interface of the pair - the "persistent"
interface. As it happens, in the QEMU implementation of this teaming
pair (called "virtio failover" in QEMU) the transient interface is
always a host network device assigned to the guest using VFIO (aka
"hostdev"); the persistent interface is always an emulated virtio NIC.
When support was initially added for <teaming>, it was written to
require that the transient/hostdev device be defined using <interface
type='hostdev'>; this was done because the virtio failover
implementation in QEMU and the virtio guest driver demands that the
two interfaces in the pair have matching MAC addresses, and the only
way libvirt can guarantee the MAC address of a hostdev network device
is to use <interface type='hostdev'>, whose main purpose is to
configure the device's MAC address before handing the device to
QEMU. (note that <interface type='hostdev'> in turn requires that the
network device be an SRIOV VF (Virtual Function), as that is the only
type of network device whose MAC address we can set in a way that will
survive the device's driver init in the guest).
It has recently come up that some users are unable to use <teaming>
because they are running in a container environment where libvirt
doesn't have the necessary privileges or resources to set the VF's MAC
address (because setting the VF MAC is done via the same device's PF
(Physical Function), and the PF is not exposed to libvirt's container).
At the same time, these users *are* able to set the VF's MAC address
themselves in advance of staring up libvirt in the container. So they
could theoretically use the <teaming> feature if libvirt just skipped
the "setting the MAC address" part.
Fortunately, that is *exactly* the difference between <interface
type='hostdev'> (which must be a "hostdev VF") and <hostdev> (a "plain
hostdev" - it could be *any* PCI device; libvirt doesn't know what type
of PCI device it is, and doesn't care).
But what is still needed is for libvirt to provide a small bit of
information on the QEMU commandline argument for the hostdev, telling
QEMU that this device will be part of a team ("failover pair"), and
the id of the other device in the pair.
To make both of those goals simultaneously possible, this patch adds
support for the <teaming> element to plain <hostdev> - libvirt doesn't
try to set any MAC addresses, and QEMU gets the extra commandline
argument it needs)
(actually, this patch adds only the parsing/formatting of the
<teaming> element in <hostdev>. The next patch will actually wire that
into the qemu driver.)
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Jiri Denemark <jdenemar@redhat.com>
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
The data reported is the same as for "host-passthrough"
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
QEMU has the ability to mark machine types as deprecated. This should be
exposed to management applications in the capabilities.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
QEMU has the ability to mark CPUs as deprecated. This should be exposed
to management applications in the domain capabilities.
This attribute is only set when the model is actually deprecated.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Pass the parameter clock rt to qemu to ensure that the
virtual machine is not synchronized with the host time
Signed-off-by: gongwei <gongwei@smartx.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Add virtio related options iommu, ats and packed as driver element attributes
to vsock devices. Ex:
<vsock model='virtio'>
<cid auto='no' address='3'/>
<driver iommu='on'/>
</vsock>
Signed-off-by: Boris Fiuczynski <fiuczy@linux.ibm.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
The virtio-pmem is a virtio variant of NVDIMM and just like
NVDIMM virtio-pmem also allows accessing host pages bypassing
guest page cache. The difference is that if a regular file is
used to back guest's NVDIMM (model='nvdimm') the persistence of
guest writes might not be guaranteed while with virtio-pmem it
is.
To express this new model at domain XML level, I've chosen the
following:
<memory model='virtio-pmem' access='shared'>
<source>
<path>/tmp/virtio_pmem</path>
</source>
<target>
<size unit='KiB'>524288</size>
</target>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</memory>
Another difference between NVDIMM and virtio-pmem is that while
the former supports NUMA node locality the latter doesn't. And
also, the latter goes onto PCI bus and not into a DIMM module.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Similarly to the domain config code it may be beneficial to control the
cache size of images introduced as snapshots into the backing chain.
Wire up handling of the 'metadata_cache' element.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
In certain specific cases it might be beneficial to be able to control
the metadata caching of storage image format drivers of a hypervisor.
Introduce XML machinery to set the maximum size of the metadata cache
which will be used by qemu's qcow2 driver.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Add documentation and schema for the new disk transport protocol.
Signed-off-by: Ryan Gahagan <rgahagan@cs.utexas.edu>
Reviewed-by: Peter Krempa <pkrempa@redhat.com>
Objects such as domain, pool, etc re-define the regex for the format.
Add more generic types for objects with/without a slash which we'll be
able to reuse also for other objects.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
New libxml2 handles '\n' properly so the literal newline is not
necessary, because 2.9.1 is the minimum version we support.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
'genericName' allows arbitrary numeric strings so using an explicit
'unsignedInt' choice is pointless. The elements take an username or a
uid which is prefixed by '+', both of which are covered by
'genericName'.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Now that individual child elements allow their children to be
interleaved, let's allow direct children of <filesystem/> to be
interleaved too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
The <binary/> element of <filesystem/> can have children elements
(<cache/> and <lock/>). Allow them to be interleaved.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>
Our <filesystem/> element can have <driver/> child element. But
with the way our schema is written it can't be interleaved and
has to go first.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Ján Tomko <jtomko@redhat.com>