docs: Provide documentation for creating custom image for VFIO CI

Extend the existing `custom-image.md` document with a new section on how
to create a custom image that contains NVIDIA drivers that are required
for our VFIO baremetal CI.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
This commit is contained in:
Sebastien Boeuf 2022-11-25 11:04:06 +01:00
parent e23f4e0783
commit 9f7ccb34cd

View File

@ -158,3 +158,172 @@ as we might need to update the direct kernel boot command line, replacing
`/dev/vda1` with the appropriate partition number.
Update all references to the previous image name to the new one.
## NVIDIA image for VFIO baremetal CI
Here we are going to describe how to create a cloud image that contains the
necessary NVIDIA drivers for our VFIO baremetal CI.
### Download base image
We usually start from one of the custom cloud image we have previously created
but we can use a stock cloud image as well.
```bash
wget https://cloud-hypervisor.azureedge.net/jammy-server-cloudimg-amd64-custom-20221118-1.raw
mv jammy-server-cloudimg-amd64-custom-20221118-1.raw jammy-server-cloudimg-amd64-nvidia.raw
```
### Extend the image size
The NVIDIA drivers consume lots of space, which is why we must resize the image
before we proceed any further.
```bash
qemu-img resize jammy-server-cloudimg-amd64-nvidia.raw 5G
```
### Resize the partition
We use `parted` for fixing the GPT after the image was resized, as well as for
resizing the `Linux` partition.
```bash
sudo parted jammy-server-cloudimg-amd64-nvidia.raw
(parted) print
Warning: Not all of the space available to jammy-server-cloudimg-amd64-nvidia.raw
appears to be used, you can fix the GPT to use all of the space (an extra 5873664
blocks) or continue with the current setting?
Fix/Ignore? Fix
Model: (file)
Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
14 1049kB 5243kB 4194kB bios_grub
15 5243kB 116MB 111MB fat32 boot, esp
1 116MB 2361MB 2245MB ext4
(parted) resizepart 1 5369MB
(parted) print
Model: (file)
Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
14 1049kB 5243kB 4194kB bios_grub
15 5243kB 116MB 111MB fat32 boot, esp
1 116MB 5369MB 5252MB ext4
(parted) quit
```
### Create a macvtap interface
Rely on the following [documentation](docs/macvtap-bridge.md) to set up a
macvtap interface to provide your VM with proper connectivity.
### Boot the image
It is particularly important to boot with a `cloud-init` disk attached to the
VM as it will automatically resize the Linux `ext4` filesystem based on the
partition that we have previously resized.
```bash
./cloud-hypervisor \
--kernel hypervisor-fw \
--disk path=focal-server-cloudimg-amd64-nvidia.raw path=/tmp/ubuntu-cloudinit.img \
--cpus boot=4 \
--memory size=4G \
--net fd=3,mac=$mac 3<>$"$tapdevice"
```
### Bring up connectivity
If your network has a DHCP server, run the following from your VM
```bash
sudo dhclient
```
But if that's not the case, let's give it an IP manually (the IP addresses
depend on your actual network) and set the DNS server IP address as well.
```bash
sudo ip addr add 192.168.2.10/24 dev ens4
sudo ip link set up dev ens4
sudo ip route add default via 192.168.2.1
sudo resolvectl dns ens4 8.8.8.8
```
#### Check connectivity and update the image
```bash
sudo apt update
sudo apt upgrade
```
### Install NVIDIA drivers
The following steps and commands are referenced from the
[NVIDIA official documentation](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts)
about Tesla compute cards.
```bash
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-key del 7fa2af80
sudo apt update
sudo apt -y install cuda-drivers
```
### Check the `nvidia-smi` tool
Quickly validate that you can find and run the `nvidia-smi` command from your
VM. At this point it should fail given no NVIDIA card has been passed through
the VM, therefore no NVIDIA driver is loaded.
### Workaround LA57 reboot issue
Add `reboot=a` to `GRUB_CMDLINE_LINUX` in `etc/default/grub` so that the VM
will be booted with the ACPI reboot type. This resolves a reboot issue when
running on 5-level paging systems.
```bash
sudo vim /etc/default/grub
sudo update-grub
sudo reboot
```
### Remove previous logins
Since our integration tests rely on past logins to count the number of reboots,
we must ensure to clear the list.
```bash
>/var/log/lastlog
>/var/log/wtmp
>/var/log/btmp
```
### Clear history
```
history -c
rm /home/cloud/.bash_history
```
### Reset cloud-init
This is mandatory as we want `cloud-init` provisioning to work again when a new
VM will be booted with this image.
```
sudo cloud-init clean
```