Applications can now insert custom nodes and hierarchies into domain
configuration XML. Although currently not enforced, applications are
required to use their own namespaces on every custom node they insert,
with only one top-level element per namespace.
POLLIN and POLLHUP are not mutually exclusive. Currently the following
seems possible: the child writes 3K to its stdout or stderr pipe, and
immediately closes it. We get POLLIN|POLLHUP (I'm not sure that's possible
on Linux, but SUSv4 seems to allow it). We read 1K and throw away the
rest.
When poll() returns and we're about to check the /revents/ member in a
given array element, let's map all the revents bits to two (independent)
ideas: "let's attempt to read()", and "let's attempt to write()". This
should cover all errors, EOFs, and normal conditions; the read()/write()
call should report any pending error.
Under this approach, both POLLHUP and POLLERR are mapped to "needs read()"
if we're otherwise prepared for POLLIN. POLLERR also maps to "needs
write()" if we're otherwise prepared for POLLOUT. The rest of the mappings
(POLLPRI etc.) would be easy, but probably useless for pipes.
Additionally, SUSv4 doesn't appear to forbid POLLIN|POLLERR (or
POLLOUT|POLLERR) set simultaneously. One could argue that the read() or
write() call would return without blocking in these cases (with an error),
so POLLIN / POLLOUT would be justified beside POLLERR.
The code now penalizes POLLIN|POLLERR differently from plain POLLERR. The
former (ie. read() returning -1) is terminal and we jump to cleanup, while
plain POLLERR masks only the affected file descriptor for the future.
Let's unify those.
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
This makes use of the QEMU guest agent to implement the
virDomainShutdownFlags and virDomainReboot APIs. With
no flags specified, it will prefer to use the agent, but
fallback to ACPI. Explicit choice can be made by using
a suitable flag
* src/qemu/qemu_driver.c: Wire up use of agent
Add a new API virDomainShutdownFlags and define:
VIR_DOMAIN_SHUTDOWN_DEFAULT = 0,
VIR_DOMAIN_SHUTDOWN_ACPI_POWER_BTN = (1 << 0),
VIR_DOMAIN_SHUTDOWN_GUEST_AGENT = (1 << 1),
Also define some flags for the reboot API
VIR_DOMAIN_REBOOT_DEFAULT = 0,
VIR_DOMAIN_REBOOT_ACPI_POWER_BTN = (1 << 0),
VIR_DOMAIN_REBOOT_GUEST_AGENT = (1 << 1),
Although these two APIs currently have the same flags, using
separate enums allows them to expand separately in the future.
Add stub impls of the new API for all existing drivers
There is now a standard QEMU guest agent that can be installed
and given a virtio serial channel
<channel type='unix'>
<source mode='bind' path='/var/lib/libvirt/qemu/f16x86_64.agent'/>
<target type='virtio' name='org.qemu.guest_agent.0'/>
</channel>
The protocol that runs over the guest agent is JSON based and
very similar to the JSON monitor. We can't use exactly the same
code because there are some odd differences in the way messages
and errors are structured. The qemu_agent.c file is based on
a combination and simplification of qemu_monitor.c and
qemu_monitor_json.c
* src/qemu/qemu_agent.c, src/qemu/qemu_agent.h: Support for
talking to the agent for shutdown
* src/qemu/qemu_domain.c, src/qemu/qemu_domain.h: Add thread
helpers for talking to the agent
* src/qemu/qemu_process.c: Connect to agent whenever starting
a guest
* src/qemu/qemu_monitor_json.c: Make variable static
When converting a linear enum to a string, we have checks in
place in the VIR_ENUM_IMPL macro to ensure that there is one
string for every value, which lets us quickly flag if a user
added a value but forgot to add a counterpart string. However,
this only works if we use the _LAST marker.
* cfg.mk (sc_require_enum_last_marker): New syntax check.
* src/conf/domain_conf.h (virDomainSnapshotState): Add new marker.
* src/conf/domain_conf.c (virDomainSnapshotState): Fix offender.
* src/qemu/qemu_monitor_json.c (qemuMonitorWatchdogAction)
(qemuMonitorIOErrorAction, qemuMonitorGraphicsAddressFamily):
Likewise.
* src/util/virtypedparam.c (virTypedParameter): Likewise.
Although this is a public API break, it only affects users that
were compiling against *_LAST values, and can be trivially
worked around without impacting compilation against older
headers, by the user defining VIR_ENUM_SENTINELS before using
libvirt.h. It is not an ABI break, since enum values do not
appear as .so entry points. Meanwhile, it prevents users from
using non-stable enum values without explicitly acknowledging
the risk of doing so.
See this list discussion:
https://www.redhat.com/archives/libvir-list/2012-January/msg00804.html
* include/libvirt/libvirt.h.in: Hide all sentinels behind
LIBVIRT_ENUM_SENTINELS, and add missing sentinels.
* src/internal.h (VIR_DEPRECATED): Allow inclusion after
libvirt.h.
(LIBVIRT_ENUM_SENTINELS): Expose sentinels internally.
* daemon/libvirtd.h: Use the sentinels.
* src/remote/remote_protocol.x (includes): Don't expose sentinels.
* python/generator.py (enum): Likewise.
* tests/cputest.c (cpuTestCompResStr): Silence compiler warning.
* tools/virsh.c (vshDomainStateReasonToString)
(vshDomainControlStateToString): Likewise.
While we still don't want to enable gcc's new -Wformat-literal
warning, I found a rather easy case where the warning could be
reduced, by getting rid of obsolete error-reporting practices.
This is the last place where we were passing the (unused) net
and conn arguments for constructing an error.
* src/util/virterror_internal.h (virErrorMsg): Delete prototype.
(virReportError): Delete macro.
* src/util/virterror.c (virErrorMsg): Make static.
* src/libvirt_private.syms (virterror_internal.h): Drop export.
* src/util/conf.c (virConfError): Convert to macro.
(virConfErrorHelper): New function, and adjust error calls.
* src/xen/xen_hypervisor.c (virXenErrorFunc): Delete.
(xenHypervisorGetSchedulerType)
(xenHypervisorGetSchedulerParameters)
(xenHypervisorSetSchedulerParameters)
(xenHypervisorDomainBlockStats)
(xenHypervisorDomainInterfaceStats)
(xenHypervisorDomainGetOSType)
(xenHypervisorNodeGetCellsFreeMemory, xenHypervisorGetVcpus):
Update callers.
Reusing common code makes things smaller; it also buys us some
additional safety, such as now rejecting duplicate parameters
during a set operation.
* src/qemu/qemu_driver.c (qemuDomainSetBlkioParameters)
(qemuDomainSetMemoryParameters, qemuDomainSetNumaParameters)
(qemuSetSchedulerParametersFlags)
(qemuDomainSetInterfaceParameters, qemuDomainSetBlockIoTune)
(qemuDomainGetBlkioParameters, qemuDomainGetMemoryParameters)
(qemuDomainGetNumaParameters, qemuGetSchedulerParametersFlags)
(qemuDomainBlockStatsFlags, qemuDomainGetInterfaceParameters)
(qemuDomainGetBlockIoTune): Use new helpers.
* src/esx/esx_driver.c (esxDomainSetSchedulerParametersFlags)
(esxDomainSetMemoryParameters)
(esxDomainGetSchedulerParametersFlags)
(esxDomainGetMemoryParameters): Likewise.
* src/libxl/libxl_driver.c
(libxlDomainSetSchedulerParametersFlags)
(libxlDomainGetSchedulerParametersFlags): Likewise.
* src/lxc/lxc_driver.c (lxcDomainSetMemoryParameters)
(lxcSetSchedulerParametersFlags, lxcDomainSetBlkioParameters)
(lxcDomainGetMemoryParameters, lxcGetSchedulerParametersFlags)
(lxcDomainGetBlkioParameters): Likewise.
* src/test/test_driver.c (testDomainSetSchedulerParamsFlags)
(testDomainGetSchedulerParamsFlags): Likewise.
* src/xen/xen_hypervisor.c (xenHypervisorSetSchedulerParameters)
(xenHypervisorGetSchedulerParameters): Likewise.
Preparation for another patch that refactors common patterns
into the new file for fewer lines of code overall.
* src/util/util.h (virTypedParameterArrayClear): Move...
* src/util/virtypedparam.h: ...to new file.
(virTypedParameterArrayValidate, virTypedParameterAssign): New
prototypes.
* src/util/util.c (virTypedParameterArrayClear): Likewise.
* src/util/virtypedparam.c: New file.
* po/POTFILES.in: Mark file for translation.
* src/Makefile.am (UTIL_SOURCES): Build it.
* src/libvirt_private.syms (util.h): Split...
(virtypedparam.h): to new section.
(virkeycode.h): Sort.
* daemon/remote.c: Adjust callers.
* tools/virsh.c: Likewise.
Based on qemu changes made in commits ae523427 and 659ded58.
* src/lxc/lxc_driver.c (lxcSetSchedulerParametersFlags)
(lxcGetSchedulerParametersFlags, lxcDomainSetBlkioParameters)
(lxcDomainGetBlkioParameters): Use helpers.
(lxcDomainSetBlkioParameters): Allow setting live and config at
once.
We had a memory leak on a very arcane OOM situation (unlikely to ever
hit in practice, but who knows if libvirt.so would ever be linked
into some other program that exhausts all thread-local storage keys?).
I found it by code inspection, while analyzing a valgrind report
generated by Alex Jia.
* src/util/threads.h (virThreadLocalSet): Alter signature.
* src/util/threads-pthread.c (virThreadHelper): Reduce allocation
lifetime.
(virThreadLocalSet): Detect failure.
* src/util/threads-win32.c (virThreadLocalSet): Likewise.
(virCondWait): Fix caller.
* src/util/virterror.c (virLastErrorObject): Likewise.
The RPC generator transforms methods matching certain
patterns like 'id' or 'uuid', etc but does not anchor
its matches to the end of the word. So if a method
contains 'id' in the middle (eg virIdentity) then the
RPC generator munges that.
* src/rpc/gendispatch.pl: Anchor matches
To avoid a namespace clash with forthcoming identity APIs,
rename the virNet*GetLocalIdentity() APIs to have the form
virNet*GetUNIXIdentity()
* daemon/remote.c, src/libvirt_private.syms: Update
for renamed APIs
* src/rpc/virnetserverclient.c, src/rpc/virnetserverclient.h,
src/rpc/virnetsocket.c, src/rpc/virnetsocket.h: s/LocalIdentity/UNIXIdentity/
There was missing capability for blkiotune and thus specifying these
settings caused libvirt to run qemu with invalid parameters and then
reporting qemu error instead of the standard libvirt one. The support
for blkiotune setting was added in upstream qemu repo under commit
0563e191516289c9d2f282a8c50f2eecef2fa773.
Given an LXC guest with a root filesystem path of
/export/lxc/roots/helloworld/root
During startup, we will pivot the root filesystem to end up
at
/.oldroot/export/lxc/roots/helloworld/root
We then try to open
/.oldroot/export/lxc/roots/helloworld/root/dev/pts
Now consider if '/export/lxc' is an absolute symlink pointing
to '/media/lxc'. The kernel will try to open
/media/lxc/roots/helloworld/root/dev/pts
whereas it should be trying to open
/.oldroot//media/lxc/roots/helloworld/root/dev/pts
To deal with the fact that the root filesystem can be moved,
we need to resolve symlinks in *any* part of the filesystem
source path.
* src/libvirt_private.syms, src/util/util.c,
src/util/util.h: Add virFileResolveAllLinks to resolve
all symlinks in a path
* src/lxc/lxc_container.c: Resolve all symlinks in filesystem
paths during startup
pciTrySecondaryBusReset checks if there is active device on the
same bus, however, qemu driver doesn't maintain an effective
list for the inactive devices, and it passes meaningless argument
for parameter "inactiveDevs". e.g. (qemuPrepareHostdevPCIDevices)
if (!(pcidevs = qemuGetPciHostDeviceList(hostdevs, nhostdevs)))
return -1;
..skipped...
if (pciResetDevice(dev, driver->activePciHostdevs, pcidevs) < 0)
goto reattachdevs;
NB, the "pcidevs" used above are extracted from domain def, and
thus one won't be able to attach a device of which bus has other
device even detached from host (nodedev-detach). To see more
details of the problem:
RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=773667
This patch is to resolve the problem by introducing an inactive
PCI device list (just like qemu_driver->activePciHostdevs), and
the whole logic is:
* Add the device to inactive list during nodedev-dettach
* Remove the device from inactive list during nodedev-reattach
* Remove the device from inactive list during attach-device
(for non-managed device)
* Add the device to inactive list after detach-device, only
if the device is not managed
With the above, we have a sufficient inactive PCI device list, and thus
we can use it for pciResetDevice. e.g.(qemuPrepareHostdevPCIDevices)
if (pciResetDevice(dev, driver->activePciHostdevs,
driver->inactivePciHostdevs) < 0)
goto reattachdevs;
This introduces new attribute wrpolicy with only supported
value as immediate. This will be an optional
attribute with no defaults. This helps specify whether
to skip the host page cache.
When wrpolicy is specified, meaning when wrpolicy=immediate
a writeback is explicitly initiated for the dirty pages in
the host page cache as part of the guest file write operation.
Usage:
<filesystem type='mount' accessmode='passthrough'>
<driver type='path' wrpolicy='immediate'/>
<source dir='/export/to/guest'/>
<target dir='mount_tag'/>
</filesystem>
Currently this only works with type='mount' for the QEMU/KVM driver.
Signed-off-by: Deepak C Shetty <deepakcs@linux.vnet.ibm.com>
In the past we didn't reserve 0:0:2.0 PCI address if there was no video
device assigned to a domain, which made it impossible to add a video
device later on. So we fixed it (commit v0.9.0-37-g7b2cac1) by always
reserving that address. However, that breaks existing domains without
video devices that already have another device assigned to the
problematic address.
This patch reserves address 0:0:2.0 only in case it was not explicitly
assigned to another device, which means libvirt will try to keep this
address free and will not automatically assign it new devices. But
existing domains for which older libvirt already assigned the address to
a non-video device will keep working as they used to work before 0.9.1.
Moreover, users who want to create a domain without a video device and
use its address for another device may do so by explicitly configuring
the PCI address in domain XML.
There are several reasons for doing this:
- the CPU specification is out of libvirt's control so we cannot
guarantee stable guest ABI
- not every feature of a CPU may actually work as expected when
advertised directly to a guest
- migration between two machines with exactly the same CPU may work but
no guarantees can be made
- this mode is not supported and its use is at one's own risk
VIR_DOMAIN_XML_UPDATE_CPU flag for virDomainGetXMLDesc may be used to
get updated custom mode guest CPU definition in case it depends on host
CPU. This patch implements the same behavior for host-model and
host-passthrough CPU modes.
The mode can be either of "custom" (default), "host-model",
"host-passthrough". The semantics of each mode is described in the
following examples:
- guest CPU is a default model with specified topology:
<cpu>
<topology sockets='1' cores='2' threads='1'/>
</cpu>
- guest CPU matches selected model:
<cpu mode='custom' match='exact'>
<model>core2duo</model>
</cpu>
- guest CPU should be a copy of host CPU as advertised by capabilities
XML (this is a short cut for manually copying host CPU specification
from capabilities to domain XML):
<cpu mode='host-model'/>
In case a hypervisor does not support the exact host model, libvirt
automatically falls back to a closest supported CPU model and
removes/adds features to match host. This behavior can be disabled by
<cpu mode='host-model'>
<model fallback='forbid'/>
</cpu>
- the same as previous returned by virDomainGetXMLDesc with
VIR_DOMAIN_XML_UPDATE_CPU flag:
<cpu mode='host-model' match='exact'>
<model fallback='allow'>Penryn</model> --+
<vendor>Intel</vendor> |
<topology sockets='2' cores='4' threads='1'/> + copied from
<feature policy='require' name='dca'/> | capabilities XML
<feature policy='require' name='xtpr'/> |
... --+
</cpu>
- guest CPU should be exactly the same as host CPU even in the aspects
libvirt doesn't model (such domain cannot be migrated unless both
hosts contain exactly the same CPUs):
<cpu mode='host-passthrough'/>
- the same as previous returned by virDomainGetXMLDesc with
VIR_DOMAIN_XML_UPDATE_CPU flag:
<cpu mode='host-passthrough' match='minimal'>
<model>Penryn</model> --+ copied from caps
<vendor>Intel</vendor> | XML but doesn't
<topology sockets='2' cores='4' threads='1'/> | describe all
<feature policy='require' name='dca'/> | aspects of the
<feature policy='require' name='xtpr'/> | actual guest CPU
... --+
</cpu>
In case a hypervisor doesn't support the exact CPU model requested by a
domain XML, we automatically fallback to a closest CPU model the
hypervisor supports (and make sure we add/remove any additional features
if needed). This patch adds 'fallback' attribute to model element, which
can be used to disable this automatic fallback.
Commit 5d784bd6d7 was a nice attempt to
clarify the semantics by requiring domain name from dxml to either match
original name or dname. However, setting dxml domain name to dname
doesn't really work since destination host needs to know the original
domain name to be able to use it in migration cookies. This patch
requires domain name in dxml to match the original domain name. The
change should be safe and backward compatible since migration would fail
just a bit later in the process.
There are three address validation routines that do nothing:
virDomainDeviceDriveAddressIsValid()
virDomainDeviceUSBAddressIsValid()
virDomainDeviceVirtioSerialAddressIsValid()
Remove them, and replace their call sites with "1" which is what they
currently return. In some cases this means we can remove an entire
if block.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
We can't call qemuCapsExtractVersionInfo() from test code, because it
expects to be able to call the emulator, and for testing we have fake
emulators that can't be executed. For that reason qemuxml2argvtest.c
doesn't call qemuDomainAssignPCIAddresses(), instead it open codes its
own version.
That means we can't call qemuDomainAssignAddresses() from the test code,
instead we need to manually call qemuDomainAssignSpaprVioAddresses().
Also add logic to cope with qemuDomainAssignSpaprVioAddresses() failing,
so that we can write a test that checks for a known failure in there.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
KVM will be able to use a PCI SCSI controller even on POWER. Let
the user specify the vSCSI controller by other means than a default.
After this patch, the QEMU driver will actually look at the model
and reject anything but auto, lsilogic and ibmvscsi.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit d09f6ba5fe introduced a regression in event
registration. virDomainEventCallbackListAddID() will only return a positive
integer if the type of event being registered is VIR_DOMAIN_EVENT_ID_LIFECYCLE.
For other event types, 0 is always returned on success. This has the
unfortunate side effect of not enabling remote event callbacks because
remoteDomainEventRegisterAny() uses the return value from the local call to
determine if an event callback needs to be registered on the remote end.
Make sure virDomainEventCallbackListAddID() returns the callback count for the
eventID being registered.
Signed-off-by: Adam Litke <agl@us.ibm.com>
The new introduced optional attribute "copy_on_read</code> controls
whether to copy read backing file into the image file. The value can
be either "on" or "off". Copy-on-read avoids accessing the same backing
file sectors repeatedly and is useful when the backing file is over a
slow network. By default copy-on-read is off.
Earlier, when the number of vcpus was greater than the topology allowed,
libvirt didn't raise an error and continued, resulting in running qemu
with parameters making no sense. Even though qemu did not report any
error itself, the number of vcpus was set to maximum allowed by the
topology.
Detected by Coverity. Although unlikely, if we are ever started
with stdin closed, we could reach a situation where we open a
uuid file but then fail to close it, making that file the new
stdin for the rest of the process.
* src/util/uuid.c (getDMISystemUUID): Allow for stdin.
Currently the LXC controller attempts to deal with EOF on a
tty by spawning a thread to do an edge triggered epoll_wait().
This avoids the normal event loop spinning on POLLHUP. There
is a subtle mistake though - even after seeing POLLHUP on a
master PTY, it is still perfectly possible & valid to write
data to the PTY. There is a buffer that can be filled with
data, even when no client is present.
The second mistake is that the epoll_wait() thread was not
looking for the EPOLLOUT condition, so when a new client
connects to the LXC console, it had to explicitly send a
character before any queued output would appear.
Finally, there was in fact no need to spawn a new thread to
deal with epoll_wait(). The epoll file descriptor itself
can be poll()'d on normally.
This patch attempts to deal with all these problems.
- The blocking epoll_wait() thread is replaced by a poll
on the epoll file descriptor which then does a non-blocking
epoll_wait() to handle events
- Even if POLLHUP is seen, we continue trying to write
any pending output until getting EAGAIN from write.
- Once write returns EAGAIN, we modify the epoll event
mask to also look for EPOLLOUT
* src/lxc/lxc_controller.c: Avoid stalled I/O upon
connected to an LXC console
If client stream does not have any data to sink and neither received
EOF, a dummy packet is sent to the daemon signalising client is ready to
sink some data. However, after we added event loop to client a race may
occur:
Thread 1 calls virNetClientStreamRecvPacket and since no data are cached
nor stream has EOF, it decides to send dummy packet to server which will
sent some data in turn. However, during this decision and actual message
exchange with server -
Thread 2 receives last stream data from server. Therefore an EOF is set
on stream and if there is a call waiting (which is not yet) it is woken
up. However, Thread 1 haven't sent anything so far, so there is no call
to be woken up. So this thread sent dummy packet to daemon, which
ignores that as no stream is associated with such packet and therefore
no reply will ever come.
This race causes client to hang indefinitely.
QEMU does not support security_model for anything but 'path' fs driver type.
Currently in libvirt, when security_model ( accessmode attribute) is not
specified it auto-generates it irrespective of the fs driver type, which
can result in a qemu error for drivers other than path. This patch ensures
that the qemu cmdline is correctly generated by taking into account the
fs driver type.
Signed-off-by: Deepak C Shetty <deepakcs@linux.vnet.ibm.com>
If a system has 64 or more VF's, it is quite tedious to mention each VF
in the interface pool.
The following modification will implicitly create an interface pool from
the SR-IOV PF.