Commit Graph

6932 Commits

Author SHA1 Message Date
Rob Bradford
064c1e2c8b virtio-devices: Avoid clashing names in imports
Don't import via glob to avoid (unused) objects colliding in the
namespace. This fixes a beta clippy issue.

Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
2023-07-05 09:36:08 -07:00
Rob Bradford
15b9d14876 net_gen: Avoid clashing names in imports
Remove use of glob imports to fix an issue detected by clippy.

Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
2023-07-05 09:36:08 -07:00
Alyssa Ross
9da363e79b vmm: ignore and warn TAP FDs send in vm.create
This does the same thing as df2a7c17 ("vmm: Ignore and warn TAP FDs
sent via the HTTP request body"), but for the vm.create endpoint,
which also previously would accept file descriptors in the body, and
try to use whatever fd occupied that number as a TAP device.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-07-05 09:36:08 -07:00
Alyssa Ross
bbfd810c3b vmm: allow restart_syscall() in PTY process
This can be triggered by debugging cloud-hypervisor using gdb, or
probably if the process is suspended and restarted.

Fixes: https://github.com/cloud-hypervisor/cloud-hypervisor/issues/5489
Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-07-05 09:36:08 -07:00
Jianyong Wu
8fb86d2284 vfio: fix vfio device fail to initialize issue for 64k page size
Currently, vfio device fails to initialize as the msix-cap region in BAR
is mapped as RW region.

To resolve the initialization issue, this commit avoids mapping the
msix-cap region in the BAR. However, this solution introduces another
problem where aligning the msix table offset in the BAR to the page
size may cause overlap with the MMIO RW region, leading to reduced
performance. By enlarging the entire region in the BAR and relocating
the msix table to achieve page size alignment, this problem can be
overcomed effectively.

Fixes: #5292
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-07-05 09:36:08 -07:00
Jianyong Wu
797110cca1 vm-allocator: Add page size related functions
To avoid code duplication extract page related functions to their
own module and add utility functions for manipulating addresses
related to page sizes

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-07-05 09:36:08 -07:00
Bo Chen
19ca5c0b84 vmm: Clarify memory regions are required to be page-size aligned
Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-07-05 09:36:08 -07:00
Bo Chen
9fe9b8504d arch: Refactor the way of creating memory mapping
This patch clarifies the assumptions we have regarding the guest address
space layout while creating memory mapping in E820 on x86_64 and fdt on
aarch64. It also explicitly checks on these assumptions and report
errors if these assumptions do not hold.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-07-05 09:36:08 -07:00
Yu Li
4fc3dd5004 vmm: memory_manager: align down the rest space of ram_region
This commit renames `ram_region_sub_size` to `ram_region_available_size`
and make its value align down to the default page size or hugepage
size of the current memory zone, which can prevent the memory zone from
being split into misaligned parts.  And if the available size of ram
region is zero, this region will be marked as consumed even it has
unused space.

Note that there is two methods to use hugepages.

1. Specify `hugepages` for `memory` or `memory-zone`, if the
   `hugepage_size` is not specified, the value can be got by `statfs`
   for `/dev/hugepages`.
2. Specify a `file` in hugetlbfs for `memory-zone`, the hugepage size
   can also be got by `statfs` for the file.

The value for alignment will be the hugepage size if this memory zone
is using hugepages, otherwise the value will be default page size of
system.

Fixes: #5463

Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2023-07-05 09:36:08 -07:00
Yu Li
6bc6365c35 arch: let arch_memory_regions return all available regions
The previous `arch_memory_regions` function will provide some memory
regions with the specified memory size and fill all the previous
regions before using the next one, but sometimes there may be no need
to fill up the previous one, e.g., the previous one should be aligned
with hugepage size.

This commit make `arch_memory_regions` function not take any
parameters and return the max available regions, the memory manager
can use them on demand.

Fixes: #5463

Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2023-07-05 09:36:08 -07:00
Yu Li
e139cdfd69 arch: create memory mapping by the actual memory info
The original codes did not consider that the previous memory region
might not be full and always set it to the maximum size.

This commit fixes this problem by creating memory mappings based on
the actual memory details in both E820 on x86_64 and fdt on aarch64.

Fixes: #5463

Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2023-07-05 09:36:08 -07:00
Yu Li
541de8b757 logger: use write with \r\n instead of writeln
The device manager will set tty or pty to raw mode, all the `\n` will
be LF without CR, which makes the output difficult to read.

This commit solves it by using `write` with `\r\n` instead of
`writeln`, which can print CR and LF explicitly.

Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2023-07-05 09:36:08 -07:00
Yu Li
184dac70a0 vmm: use unwrap_or instead of match for prefault
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2023-07-05 09:36:08 -07:00
Jianyong Wu
022b489e7b arch: x86_64: Populate the APIC Id
Program the APIC ID (CPUID leaf 0x1 EBX) with the CPU id. This resolves
an issue where the EDKII firmware expects the APIC ID to vary per-CPU.

Fixes: #5475
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-07-05 09:36:08 -07:00
Alyssa Ross
399e2f9f7d vmm, virtio-devices: allow mremap for consoles
SerialBuffer uses VecDeque::extend, which calls realloc, which a
maximum buffer size of 1 MiB.  Starting at allocation sizes of
128 KiB, musl's mallocng allocator will use mremap for the allocation.
Since this was not permitted by the seccomp rules, heavy write load
could crash cloud-hypervisor with a seccomp failure.  (Encountered
using virtio-console, but I don't see any reason it wouldn't happen
for the legacy serial device too.)

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-07-05 09:36:08 -07:00
Rafael Mendonca
22cc96494f main: Fix error propagation if starting the VM fails
Commit 21d40d7 ("main: reset tty if starting the VM fails") changed
start_vmm() to join the vmm thread if an error happens after the vmm
thread is started. The implementation put all the error-prone code that
is run after the vmm is started in a closure, to be able to always join
the vmm thread, regardless of any error happening. However, it missed
propagating the error that might happen inside the closure back to the
main function, after joining the vmm thread.

For some cmd line options, the above issue inhibits proper error
reporting when starting a VM with invalid commands, as many parameters
are parsed after the vmm is started, thus if such parsing fails, no
error will be reported back to the user.

See: #5435
Fixes: 21d40d7 ("main: reset tty if starting the VM fails")
Signed-off-by: Rafael Mendonca <rafaelmendsr@gmail.com>
2023-07-05 09:36:08 -07:00
Bo Chen
f98402ec15 vmm: Allocate guest memory address space before TDX initialization
The refactoring on deferring address space allocation (#5169) broke TDX,
as TDX initialization needs to access guest memory for encryption and
measurement of guest pages.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-07-05 09:36:08 -07:00
Jianyong Wu
acc54ade7b vfio: align memory region size and address to PAGE_SIZE
In current implementation, memory region used in vfio is assumed to
align to 4k which may cause error when the PAGE_SIZE is not 4k, like on
Arm, it can be 16k and 64k.

Remove this assumption and align memory resource used by vfio to
PAGE_SIZE then vfio can run on host with 64k PAGE_SIZE.

Fixes: #5292
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-07-05 09:36:08 -07:00
Alyssa Ross
0ebbb3f8a2 vmm: allow getdents64 in seccomp filter
This is used on older kernels where close_range() is not available.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
Fixes: 505f4dfa ("vmm: close all unused fds in sigwinch listener")
2023-07-05 09:36:08 -07:00
Anatol Belski
95511287ec tests: Enable topology integration tests under mshv
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
2023-07-05 09:36:08 -07:00
Anatol Belski
5a3af30e6a seccomp: Add filter entry for MSHV_VP_REGISTER_INTERCEPT_RESULT
Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
2023-07-05 09:36:08 -07:00
Anatol Belski
034b48faf7 mshv: Pass topology explicitly while constructing cpuid
Unlike KVM, there's no internal handling for topoolgy under MSHV. Thus,
if no topology has been passed during the CH launch, use the boot CPUs
count to construct the topology struct.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
2023-07-05 09:36:08 -07:00
Anatol Belski
ba3e02ce86 hypervisor: mshv: Implement set_cpuid2 call
Passing the CPUID leafs with the topology is integrated into the common
mechanism of setting and patching CPUID in Cloud Hypervisor. All the
CPUID values will be passed to the hypervisor through the register
intercept call.

Signed-off-by: Anatol Belski <anbelski@linux.microsoft.com>
2023-07-05 09:36:08 -07:00
Alyssa Ross
5492259af9 main: reset tty if starting the VM fails
When I refactored this to centralise resetting the tty into
DeviceManager::drop, I tested that the tty was reset if an error
happened on the vmm thread, but not on the main thread.  It turns out
that if an error happened on the main thread, the process would just
exit, so drop handlers on other threads wouldn't get run.

To fix this, I've changed start_vmm() to write to the VMM's exit
eventfd and then join the thread if an error happens after the vmm
thread is started.

Fixes: b6feae0a ("vmm: only touch the tty flags if it's being used")
Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-07-05 09:36:08 -07:00
Alyssa Ross
cc1254d5e1 vmm: reset to the original termios
Previously, we used two different functions for configuring ttys.
vmm_sys_util::terminal::Terminal::set_raw_mode() was used to configure
stdio ttys, and cfmakeraw() was used to configure ptys created by
cloud-hypervisor.  When I centralized the stdio tty cleanup, I also
switched to using cfmakeraw() everywhere, to avoid duplication.

cfmakeraw sets the OPOST flag, but when we later reset the ttys, we
used vmm_sys_util::terminal::Terminal::set_canon_mode(), which does
not unset this flag.  This meant that the terminal was getting mostly,
but not fully, reset.

To fix this without depending on the implementation of cfmakeraw(),
let's just store the original termios for stdio terminals, and restore
them to exactly the state we found them in when cloud-hypervisor exits.

Fixes: b6feae0a ("vmm: only touch the tty flags if it's being used")
Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-07-05 09:36:08 -07:00
Rob Bradford
34bb3319d4 hypervisor, vmm: Limit max number of vCPUs to hypervisor maximum
On KVM this is provided by an ioctl, on MSHV this is constant. Although
there is a HV_MAXIMUM_PROCESSORS constant the MSHV ioctl API is limited
to u8.

Signed-off-by: Rob Bradford <rbradford@rivosinc.com>
2023-07-05 09:36:08 -07:00
Bo Chen
d530569ac2 build: Release v31.1 (bug fix release)
Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 16:41:02 -07:00
Bo Chen
75956e64ec tests: Extend '_test_macvtap()' with reboot
In this way, we can cover the scenario where a VM with hotplugged net
device using FDs can work properly with reboot.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
ae646c2a00 vmm: Add valid FDs for TAP devices to 'VmConfig::preserved_fds'
In this way, valid FDs for TAP devices will be closed when the holding
VmConfig instance is destroyed.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
04d3e5bbf5 vmm: Add unit test for 'VmConfig::preserved_fds'
Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
cbe972659c vmm: Implement Clone and Drop for VmConfig
The custom 'clone' duplicates 'preserved_fds' so that the validation
logic can be safely carried out on the clone of the VmConfig.

The custom 'drop' ensures 'preserved_fds' are safely closed when the
holding VmConfig instance is destroyed.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
1e4e03d110 vmm: config: Extend 'VmConfig' with 'preserved_fds'
Preserved FDs are the ones that share the same life-time as its holding
VmConfig instance, such as FDs for creating TAP devices.

Preserved FDs will stay open as long as the holding VmConfig instance is
valid, and will be closed when the holding VmConfig instance is destroyed.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
4876f7550d Revert "vmm: config: Implement Clone for NetConfig"
This reverts commit ea4a95c4f6.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
c0146e3ef1 Revert "vmm: config: Close FDs for TAP devices that are provided to VM"
This reverts commit b14427540b.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
77a205881b Revert "vmm: config: Don't close reserved FDs from NetConfig::drop()"
This reverts commit 0110fb4edc.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
321421c53e Revert "vmm: config: Avoid closing invalid FDs from 'test_net_parsing()'"
This reverts commit 0567def931.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Bo Chen
48a87e699d Revert "vmm: config: Replace use of memfd_create with fd pointing to /dev/null"
This reverts commit 46066d6ae1.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Alyssa Ross
147a800d5d vmm: only touch the tty flags if it's being used
When neither serial nor console are connected to the tty,
cloud-hypervisor shouldn't touch the tty at all.  One way in which
this is annoying is that if I am running cloud-hypervisor without it
using my terminal, I expect to be able to suspend it with ^Z like any
other process, but that doesn't work if it's put the terminal into raw
mode.

Instead of putting the tty into raw mode when a VM is created or
restored, do it when a serial or console device is created.  Since we
now know it can't be put into raw mode until the Vm object is created,
we can move setting it back to canon mode into the drop handler for
that object, which should always be run in normal operation.  We still
also put the tty into canon mode in the SIGTERM / SIGINT handler, but
check whether the tty was actually used, rather than whether stdin is
a tty.  This requires passing on_tty around as an atomic boolean.

I explored more of an abstraction over the tty — having an object that
encapsulated stdout and put the tty into raw mode when initialized and
into canon mode when dropped — but it wasn't practical, mostly due to
the special requirements of the signal handler.  I also investigated
whether the SIGWINCH listener process could be used here, which I
think would have worked but I'm hesitant to involve it in serial
handling as well as conosle handling.

There's no longer a check for whether the file descriptor is a tty
before setting it into canon mode — it's redundant, because if it's
not a tty it just won't respond to the ioctl.

Tested by shutting down through the API, SIGTERM, and an error
injected after setting raw mode.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-18 10:47:33 -07:00
Alyssa Ross
eaf8cbd47d vmm: don't redundantly set the TTY to canon mode
If the VM is shut down, either it's going to be started again, in
which case we still want to be in raw mode, or the process is about to
exit, in which case canon mode will be set at the end of main.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-18 10:47:33 -07:00
Alyssa Ross
9d24e862eb vmm: only use KVM_ARM_VCPU_PMU_V3 if available
Having PMU in guests isn't critical, and not all hardware supports
it (e.g. Apple Silicon).

CpuManager::init_pmu already has a fallback for if PMU is not
supported by the VCPU, but we weren't getting that far, because we
would always try to initialise the VCPU with KVM_ARM_VCPU_PMU_V3, and
then bail when it returned with EINVAL.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-18 10:47:33 -07:00
Alyssa Ross
11324ac21c virtio-devices: seccomp: add vhost-user syscalls
Cloud Hypervisor's vhost-user implementation will reconnect if it gets
disconnected from the backend.  That means connections happen inside
the vhost-user seccomp sandbox, so all syscalls used in reconnecting
have to be allowed in that sandbox.

clock_nanosleep is used by Glibc, and nanosleep is used by musl.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-18 10:47:33 -07:00
Bo Chen
ce75865e2c vmm: Ignore and warn TAP FDs sent via the HTTP request body
Valid FDs can only be sent from another process via `SCM_RIGHTS`.

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-04-18 10:47:33 -07:00
Michael Zhao
f3522e85fc build: Release v31.0
Signed-off-by: Michael Zhao <michael.zhao@arm.com>
2023-04-06 07:05:11 -07:00
dependabot[bot]
d3ac6a85e0 build: Bump serde_with from 2.3.1 to 2.3.2 in /fuzz
Bumps [serde_with](https://github.com/jonasbb/serde_with) from 2.3.1 to 2.3.2.
- [Release notes](https://github.com/jonasbb/serde_with/releases)
- [Commits](https://github.com/jonasbb/serde_with/compare/v2.3.1...v2.3.2)

---
updated-dependencies:
- dependency-name: serde_with
  dependency-type: indirect
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2023-04-06 00:11:31 +00:00
Alyssa Ross
38a1b45783 vmm: use the SIGWINCH listener for TTYs too
Previously, we were only using it for PTYs, because for PTYs there's
no alternative.  But since we have to have it for PTYs anyway, if we
also use it for TTYs, we can eliminate all of the code that handled
SIGWINCH for TTYs.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:23:06 +01:00
Alyssa Ross
e9841db486 vmm: don't ignore errors from SIGWINCH listener
Now that the SIGWINCH listener has fallbacks for older kernels, we
don't expect it to routinely fail, so if there's an error setting it
up, we want to know about it.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:23:06 +01:00
Alyssa Ross
c1f555cde3 vmm: fall back if CLONE_CLEAR_SIGHAND unsupported
This will allow the SIGWINCH listener to run on kernels older than
5.5, although on those kernels it will have to make 64 syscalls to
reset all the signal handlers.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:23:06 +01:00
Alyssa Ross
505f4dfa53 vmm: close all unused fds in sigwinch listener
The PTY main file descriptor had to be introduced as a parameter to
start_sigwinch_listener, so that it could be closed in the child.
Really the SIGWINCH listener process should not have any file
descriptors open, except for the ones it needs to function, so let's
make it more robust by having it close all other file descriptors.

For recent kernels, we can do this very conveniently with
close_range(2), but for older kernels, we have to fall back to closing
open file descriptors one at a time.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:23:06 +01:00
Alyssa Ross
67ad3ff1ba scripts: run doc tests
Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:22:47 +01:00
Alyssa Ross
755cabea4c hypervisor: use proper doc tests for examples
It seems like these examples were always intended to be doctests,
since there are lines marked with "#" so that they are excluded from
the generated documentation, but they were not recognised as doc tests
because they were not formatted correctly.

The code needed some adjustments so that it would actually compile and
run as doctests.

Signed-off-by: Alyssa Ross <hi@alyssa.is>
2023-04-05 11:22:47 +01:00