If the host OS doesn't have NUMA present, we fallback to
populating fake NUMA info and the code thus assumes only a
single NUMA node.
Unfortunately we also fallback to fake NUMA if numactl-devel
was not present, and in this case we can still have multiple
NUMA nodes. In this case we create all CPUs, but only the
CPUs in the first node have any data filled in, resulting in
capabilities like:
<topology>
<cells num='1'>
<cell id='0'>
<memory unit='KiB'>15977572</memory>
<cpus num='48'>
<cpu id='0' socket_id='0' core_id='0' siblings='0'/>
<cpu id='1' socket_id='0' core_id='0' siblings='1'/>
<cpu id='2' socket_id='0' core_id='1' siblings='2'/>
<cpu id='3' socket_id='0' core_id='1' siblings='3'/>
<cpu id='4' socket_id='0' core_id='2' siblings='4'/>
<cpu id='5' socket_id='0' core_id='2' siblings='5'/>
<cpu id='6' socket_id='0' core_id='3' siblings='6'/>
<cpu id='7' socket_id='0' core_id='3' siblings='7'/>
<cpu id='8' socket_id='0' core_id='4' siblings='8'/>
<cpu id='9' socket_id='0' core_id='4' siblings='9'/>
<cpu id='10' socket_id='0' core_id='5' siblings='10'/>
<cpu id='11' socket_id='0' core_id='5' siblings='11'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
<cpu id='0'/>
</cpus>
</cell>
</cells>
</topology>
With this new code we get something slightly less broken
<topology>
<cells num='4'>
<cell id='0'>
<memory unit='KiB'>15977572</memory>
<cpus num='12'>
<cpu id='0' socket_id='0' core_id='0' siblings='0-1'/>
<cpu id='1' socket_id='0' core_id='0' siblings='0-1'/>
<cpu id='2' socket_id='0' core_id='1' siblings='2-3'/>
<cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
<cpu id='4' socket_id='0' core_id='2' siblings='4-5'/>
<cpu id='5' socket_id='0' core_id='2' siblings='4-5'/>
<cpu id='6' socket_id='0' core_id='3' siblings='6-7'/>
<cpu id='7' socket_id='0' core_id='3' siblings='6-7'/>
<cpu id='8' socket_id='0' core_id='4' siblings='8-9'/>
<cpu id='9' socket_id='0' core_id='4' siblings='8-9'/>
<cpu id='10' socket_id='0' core_id='5' siblings='10-11'/>
<cpu id='11' socket_id='0' core_id='5' siblings='10-11'/>
</cpus>
</cell>
<cell id='0'>
<memory unit='KiB'>15977572</memory>
<cpus num='12'>
<cpu id='12' socket_id='0' core_id='0' siblings='12-13'/>
<cpu id='13' socket_id='0' core_id='0' siblings='12-13'/>
<cpu id='14' socket_id='0' core_id='1' siblings='14-15'/>
<cpu id='15' socket_id='0' core_id='1' siblings='14-15'/>
<cpu id='16' socket_id='0' core_id='2' siblings='16-17'/>
<cpu id='17' socket_id='0' core_id='2' siblings='16-17'/>
<cpu id='18' socket_id='0' core_id='3' siblings='18-19'/>
<cpu id='19' socket_id='0' core_id='3' siblings='18-19'/>
<cpu id='20' socket_id='0' core_id='4' siblings='20-21'/>
<cpu id='21' socket_id='0' core_id='4' siblings='20-21'/>
<cpu id='22' socket_id='0' core_id='5' siblings='22-23'/>
<cpu id='23' socket_id='0' core_id='5' siblings='22-23'/>
</cpus>
</cell>
</cells>
</topology>
The topology at least now reflects what 'virsh nodeinfo' reports.
The main bug is that the CPU "id" values won't match what the Linux
host actually uses.
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The 'caps' object is already allocated when the fake NUMA
initialization takes place.
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The current 'for' loop with 5 consecutive 'ifs' inside
qemuBuildHostdevCommandLine can be a bit smarter:
- all 5 'ifs' fails if hostdev->mode is not equal to
VIR_DOMAIN_HOSTDEV_MODE_SUBSYS. This check can be moved to the
start of the loop, failing to the next element immediately
in case it fails;
- all 5 'ifs' checks for a specific subsys->type to build the proper
command line argument (virHostdevIsSCSIDevice and virHostdevIsMdevDevice
do that but within a helper). Problem is that the code will keep
checking for matches even if one was already found, and there is
no way a hostdev will fit more than one 'if' (i.e. a hostdev can't
have 2+ different types). This means that a SUBSYS_TYPE_USB will
create its command line argument in the first 'if', then all other
conditionals will surely fail but will end up being checked anyway.
All of this can be avoided by moving the hostdev->mode comparing
to the start of the loop and using a switch statement with
subsys->type to execute the proper code for a given hostdev
type.
Suggested-by: Ján Tomko <jtomko@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
The code calling this method expects it to have reported an error on
failure.
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When freeing qemu driver struct members, we forgot to free
@hostcpu and @hostnuma members.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
This function is supposed to clean up virQEMUDriver structure and
free individual members. However, it's doing that in random order
which makes it hard to track which members are being freed and
which are not. Do the free in reverse order than the structure
definition - assuming that the most important members (like
mutex) are declared first and freed last.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Fortunately, this is not causing any problems now because glib
does this check for us when calling this function via attribute
cleanup. But in a future commit we will explicitly call this
function over a struct member that might be NULL.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
In testXLInitDriver() a dummy driver structure is filled and it
is freed later in testXLFreeDriver(). However, it is sufficient
to unref just driver->config because that results in
libxlDriverConfigDispose() being called which unrefs
driver->config->caps. There is no need to unref it again in
testXLFreeDriver() - in fact it's undesired.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
When generating domain capabilities, we need to fake host CPU to
get reproducible result. We do this by copying a pre-existent CPU
config and setting VIR_TEST_MOCK_FAKE_HOST_CPU env variable which
is then consumed by qemucpumock. However, we forget to free the
CPU copy afterwards.
2,196 (2,016 direct, 180 indirect) bytes in 18 blocks are definitely lost in loss record 291 of 297
at 0x4838B86: calloc (vg_replace_malloc.c:762)
by 0x57CB6A0: g_malloc0 (in /usr/lib64/libglib-2.0.so.0.6000.7)
by 0x4A0F72D: virCPUDefNew (cpu_conf.c:87)
by 0x4A0FAC7: virCPUDefCopyWithoutModel (cpu_conf.c:235)
by 0x4A0FBBE: virCPUDefCopy (cpu_conf.c:273)
by 0x10E3C0: testUtilsHostCpusGetDefForArch (testutilshostcpus.h:157)
by 0x10E3C0: fakeHostCPU (domaincapstest.c:61)
by 0x10E3C0: fillQemuCaps (domaincapstest.c:86)
by 0x10E3C0: test_virDomainCapsFormat (domaincapstest.c:234)
by 0x10F4BC: virTestRun (testutils.c:146)
by 0x10DE93: doTestQemuInternal (domaincapstest.c:301)
by 0x10E13D: doTestQemu (domaincapstest.c:332)
by 0x1124CF: testQemuCapsIterate (testutilsqemu.c:635)
by 0x10DCE3: mymain (domaincapstest.c:435)
by 0x10FD8B: virTestMain (testutils.c:916)
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Assuming that the backing image format is raw is wrong when doing image
detection:
1) In -drive mode qemu will still probe the image format of the backing
image. This means it will try to open a backing file of the image
which will fail if a more advanced security model is in use.
2) In blockdev mode the image will be opened as raw actually which is
wrong since it might be qcow. Not opening the backing images will
also end up in the guest seeing corrupted data.
Rather than attempt to solve various corner cases when us assuming the
storage file being raw and actually being right forbid startup when the
guest image doesn't have the format specified in the metadata.
https://bugzilla.redhat.com/show_bug.cgi?id=1588373
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
EXP_WARN and ALLOW_PROBE flags for the testStorageChain cases are no
longer used so we can remove them.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Pass in 'true' as '@report_broken' of virStorageFileGetMetadata to make
it fail in the tests. The most important code paths (when starting the
VM) expect this function to fail rather than silently return partial
data. Switch the test to exercise this more important code path.
Signed-off-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Prior to commit 55ce656463 (first in libvirt 4.6.0), the XML sent to
virDomainAttachDeviceFlags() was parsed only once, and the results of
that parse were inserted into both the live object of the running
domain and into the persistent config. Thus, if MAC address was
omitted from in XML for a network device (<interface>), both the live
and config object would have the same MAC address.
Commit 55ce656463 changed the code to parse the incoming XML twice -
once for live and once for config. This does eliminate the problem of
PCI (/scsi/sata) address conflicts caused by allocating an address
based on existing devices in live object, but then inserting the
result into the config (which may already have a device using that
address), BUT it also means that when the MAC address of a network
device hasn't been specified in the XML, each copy will get a
different auto-generated MAC address.
This results in the MAC address of the device changing the next time
the domain is shutdown and restarted, which creates havoc with the
guest OS's network config.
There have been several discussions about this in the last > 1 year,
attempting to find the ideal solution to this problem that makes MAC
addresses consistent and accounts for all sorts of corner cases with
PCI/scsi/sata addresses. All of these discussions fizzled out because
every proposal was either too difficult to implement or failed to fix
some esoteric case someone thought up.
So, in the interest of solving the MAC address problem while not
making the "other address" situation any worse than before, this patch
simply adds a qemuDomainAttachDeviceLiveAndConfigHomogenize() function
that (for now) copies the MAC address from the config object to the
live object (if the original xml had <mac address='blah'/> then this
will be an effective NOP (as the macs already match)).
Any downstream libvirt containing upstream commit
55ce656463 should have this patch as well.
https://bugzilla.redhat.com/1783411
Signed-off-by: Laine Stump <laine@redhat.com>
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
The intent of get_nonnull_domain() is not to validate virDomain
as sent by the client but just to construct the virDomain
structure. The validation is then done in each API when looking
up the domain in our internal hash tables.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
There are some functions which pass virConnectPtr around for one
reason and one reason only: to obtain virLXCDriverPtr in the end.
Might replace the argument and pass a pointer to the driver right
from the start.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
If we use glib alloc functions, we can drop the 'cleanup' label
and @rv variable and also simplify the code a bit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Some variables are not used outside of the for() loop. Move their
declaration to clean up the code a bit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
When using the monolithic daemon, then dom->conn has all driver
tables filled in properly and thus it's safe to call an API other
than virDomain*(). However, when using split daemons then
dom->conn has only hypervisor driver table set
(dom->conn->driver) and the rest is NULL. Therefore, if we want
to call a non-domain API (virNetworkLookupByName() in this case),
we have obtain the cached connection object accessible via
virGetConnectNetwork().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
If we use glib alloc functions, we can drop the 'cleanup' label
and @rv variable and also simplify the code a bit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Some variables are not used outside of the for() loop. Move their
declaration to clean up the code a bit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
When using the monolithic daemon, then dom->conn has all driver
tables filled in properly and thus it's safe to call an API other
than virDomain*(). However, when using split daemons then
dom->conn has only hypervisor driver table set
(dom->conn->driver) and the rest is NULL. Therefore, if we want
to call a non-domain API (virNetworkLookupByName() in this case),
we have obtain the cached connection object accessible via
virGetConnectNetwork().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
If we place qemuDomainInterfaceAddresses() a few lines below the
two functions its using then we can drop forward declarations of
those functions.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
While at bugfixing, convert the whole function to the new-style memory
allocation handling.
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Signed-off-by: Pavel Mores <pmores@redhat.com>
Pick 256k as the limit.
While -Wno-frame-larger-than would make more sense for usage
in our test suite, the -Wno version seems to have no effect
if -Wframe-larger-than was already specified.
Use an (un)reasonably large value instead.
Fixes the build with clang:
../../tests/cputest.c:964:1: error: stack frame size of 33176 bytes
in function 'mymain' [-Werror,-Wframe-larger-than=]
mymain(void)
^
1 error generated.
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
My commit e73889b631
split the -Wframe-larger-than warning setting into
two different variables - STRICT_FRAME_LIMIT_CFLAGS
for the library code and RELAXED_FRAME_LIMIT_CFLAGS
which was needed for tests.
Use the strict limit by default and specify the warning
flag twice for the parts that require a larger stack
frame, relying on the fact that the compiler will pick
up the latter value.
Signed-off-by: Ján Tomko <jtomko@redhat.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
This is slightly more complicated because NVMe disk source is not
a simple attribute to <source/> element. The format in which the
PCI address and namespace ID are printed is the same as QEMU
accepts them:
nvme://XXXX:XX:XX.X/X
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
With NVMe disks, one can start a blockjob with a NVMe disk
that is not visible in domain XML (at least right away). Usually,
it's fairly easy to override this limitation of
qemuDomainGetMemLockLimitBytes() - for instance for hostdevs we
temporarily add the device to domain def, let the function
calculate the limit and then remove the device. But it's not so
easy with virStorageSourcePtr - in some cases they don't
necessarily are attached to a disk. And even if they are it's
done later in the process and frankly, I find it too complicated
to be able to use the simple trick we use with hostdevs.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
At the very beginning of the attach function the
qemuDomainStorageSourceChainAccessAllow() is called which
modifies CGroups, locks and seclabels for new disk and its
backing chain. This must be followed by a counterpart which
reverts back all the changes if something goes wrong. This boils
down to calling qemuDomainStorageSourceChainAccessRevoke() which
is done under 'error' label. But not all failure branches jump
there. They just jump onto 'cleanup' label where no revoke is
done. Such mistake is easy to do because 'cleanup' label does
exist. Therefore, dissolve 'error' block in 'cleanup' and have
everything jump onto 'cleanup' label.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Because this is a HMP we're dealing with, there is nothing like
class of reply message, so we have to do some string comparison
to guess if the command fails. Well, with NVMe disks whole new
class of errors comes to play because qemu needs to initialize
IOMMU and VFIO for them. You can see all the messages it may
produce in qemu_vfio_init_pci().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Now, that we have everything prepared, we can generate command
line for NVMe disks.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
This capability tracks if qemu is capable of:
-drive file.driver=nvme
The feature was added in QEMU's commit of v2.12.0-rc0~104^2~2.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
This function is currently not called for any type of storage
source that is not considered 'local' (as defined by
virStorageSourceIsLocalStorage()). Well, NVMe disks are not
'local' from that point of view and therefore we will need to
call this function more frequently.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
If a domain has an NVMe disk configured, then we need to allow it
on devices CGroup so that qemu can access it. There is one caveat
though - if an NVMe disk is read only we need CGroup to allow
write too. This is because when opening the device, qemu does
couple of ioctl()-s which are considered as write.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
There are couple of places where a domain with a VFIO device gets
special treatment: in CGroups when enabling/disabling access to
/dev/vfio/vfio, and when creating/removing nodes in domain mount
namespace. Well, a NVMe disk is a VFIO device too. Fortunately,
we have this qemuDomainNeedsVFIO() function which is the only
place that needs adjustment.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
If a domain has an NVMe disk configured, then we need to create
/dev/vfio/* paths in domain's namespace so that qemu can open
them.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
We have this beautiful function that does crystal ball
divination. The function is named
qemuDomainGetMemLockLimitBytes() and it calculates the upper
limit of how much locked memory is given guest going to need. The
function bases its guess on devices defined for a domain. For
instance, if there is a VFIO hostdev defined then it adds 1GiB to
the guessed maximum. Since NVMe disks are pretty much VFIO
hostdevs (but not quite), we have to do the same sorcery.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
The qemu driver has its own wrappers around virHostdev module (so
that some arguments are filled in automatically). Extend these to
include NVMe devices too.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
The device configs (which are actually the same one config)
come from a NVMe disk of mine.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Now that we have virNVMeDevice module (introduced in previous
commit), let's use it int virHostdev to track which NVMe devices
are free to be used by a domain and which are taken.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
This module will be used by virHostdevManager and it's inspired
by virPCIDevice module. They are very similar except instead of
what makes a NVMe device: PCI address AND namespace ID. This
means that a NVMe device can appear in a domain multiple times,
each time with a different namespace.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
This function will return true if any of disks (or their backing
chain) for given domain contains an NVMe disk.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
This function will return true if there's a storage source of
type VIR_STORAGE_TYPE_NVME, or false otherwise.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
ACKed-by: Peter Krempa <pkrempa@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
To simplify implementation, some restrictions are added. For
instance, an NVMe disk can't go to any bus but virtio and has to
be type of 'disk' and can't have startupPolicy set.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>
There is this class of PCI devices that act like disks: NVMe.
Therefore, they are both PCI devices and disks. While we already
have <hostdev/> (and can assign a NVMe device to a domain
successfully) we don't have disk representation. There are three
problems with PCI assignment in case of a NVMe device:
1) domains with <hostdev/> can't be migrated
2) NVMe device is assigned whole, there's no way to assign only a
namespace
3) Because hypervisors see <hostdev/> they don't put block layer
on top of it - users don't get all the fancy features like
snapshots
NVMe namespaces are way of splitting one continuous NVDIMM memory
into smaller ones, effectively creating smaller NVMe-s (which can
then be partitioned, LVMed, etc.)
Because of all of this the following XML was chosen to model a
NVMe device:
<disk type='nvme' device='disk'>
<driver name='qemu' type='raw'/>
<source type='pci' managed='yes' namespace='1'>
<address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
</source>
<target dev='vda' bus='virtio'/>
</disk>
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
Reviewed-by: Cole Robinson <crobinso@redhat.com>