docs: Provide documentation for creating custom image for VFIO CI

Extend the existing `custom-image.md` document with a new section on how to create a custom image that contains NVIDIA drivers that are required for our VFIO baremetal CI. Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2025-03-20 07:58:55 +00:00 · 2022-11-25 11:04:06 +01:00 · 2022-11-25 11:04:06 +01:00 · 9f7ccb34cd
commit 9f7ccb34cd
parent e23f4e0783
1 changed files with 169 additions and 0 deletions
--- a/docs/custom-image.md
+++ b/docs/custom-image.md
@ -158,3 +158,172 @@ as we might need to update the direct kernel boot command line, replacing
 `/dev/vda1` with the appropriate partition number.

 Update all references to the previous image name to the new one.
+
+## NVIDIA image for VFIO baremetal CI
+
+Here we are going to describe how to create a cloud image that contains the
+necessary NVIDIA drivers for our VFIO baremetal CI.
+
+### Download base image
+
+We usually start from one of the custom cloud image we have previously created
+but we can use a stock cloud image as well.
+
+```bash
+wget https://cloud-hypervisor.azureedge.net/jammy-server-cloudimg-amd64-custom-20221118-1.raw
+mv jammy-server-cloudimg-amd64-custom-20221118-1.raw jammy-server-cloudimg-amd64-nvidia.raw
+```
+
+### Extend the image size
+
+The NVIDIA drivers consume lots of space, which is why we must resize the image
+before we proceed any further.
+
+```bash
+qemu-img resize jammy-server-cloudimg-amd64-nvidia.raw 5G
+```
+
+### Resize the partition
+
+We use `parted` for fixing the GPT after the image was resized, as well as for
+resizing the `Linux` partition.
+
+```bash
+sudo parted jammy-server-cloudimg-amd64-nvidia.raw
+
+(parted) print
+Warning: Not all of the space available to jammy-server-cloudimg-amd64-nvidia.raw
+appears to be used, you can fix the GPT to use all of the space (an extra 5873664
+blocks) or continue with the current setting?
+Fix/Ignore? Fix
+Model:  (file)
+Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB
+Sector size (logical/physical): 512B/512B
+Partition Table: gpt
+Disk Flags:
+
+Number  Start   End     Size    File system  Name  Flags
+14      1049kB  5243kB  4194kB                     bios_grub
+15      5243kB  116MB   111MB   fat32              boot, esp
+ 1      116MB   2361MB  2245MB  ext4
+
+(parted) resizepart 1 5369MB
+(parted) print
+Model:  (file)
+Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB
+Sector size (logical/physical): 512B/512B
+Partition Table: gpt
+Disk Flags:
+
+Number  Start   End     Size    File system  Name  Flags
+14      1049kB  5243kB  4194kB                     bios_grub
+15      5243kB  116MB   111MB   fat32              boot, esp
+ 1      116MB   5369MB  5252MB  ext4
+
+(parted) quit
+```
+
+### Create a macvtap interface
+
+Rely on the following [documentation](docs/macvtap-bridge.md) to set up a
+macvtap interface to provide your VM with proper connectivity.
+
+### Boot the image
+
+It is particularly important to boot with a `cloud-init` disk attached to the
+VM as it will automatically resize the Linux `ext4` filesystem based on the
+partition that we have previously resized.
+
+```bash
+./cloud-hypervisor \
+	--kernel hypervisor-fw  \
+	--disk path=focal-server-cloudimg-amd64-nvidia.raw path=/tmp/ubuntu-cloudinit.img \
+	--cpus boot=4 \
+	--memory size=4G \
+	--net fd=3,mac=$mac 3<>$"$tapdevice"
+```
+	
+### Bring up connectivity
+
+If your network has a DHCP server, run the following from your VM
+
+```bash
+sudo dhclient
+```
+
+But if that's not the case, let's give it an IP manually (the IP addresses
+depend on your actual network) and set the DNS server IP address as well.
+
+```bash
+sudo ip addr add 192.168.2.10/24 dev ens4
+sudo ip link set up dev ens4
+sudo ip route add default via 192.168.2.1
+sudo resolvectl dns ens4 8.8.8.8
+```
+
+#### Check connectivity and update the image
+
+```bash
+sudo apt update
+sudo apt upgrade
+```
+
+### Install NVIDIA drivers
+
+The following steps and commands are referenced from the
+[NVIDIA official documentation](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts)
+about Tesla compute cards.
+
+```bash
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
+wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
+sudo dpkg -i cuda-keyring_1.0-1_all.deb
+sudo apt-key del 7fa2af80
+sudo apt update
+sudo apt -y install cuda-drivers
+```
+
+### Check the `nvidia-smi` tool
+
+Quickly validate that you can find and run the `nvidia-smi` command from your
+VM. At this point it should fail given no NVIDIA card has been passed through
+the VM, therefore no NVIDIA driver is loaded.
+
+### Workaround LA57 reboot issue
+
+Add `reboot=a` to `GRUB_CMDLINE_LINUX` in `etc/default/grub` so that the VM
+will be booted with the ACPI reboot type. This resolves a reboot issue when
+running on 5-level paging systems.
+
+```bash
+sudo vim /etc/default/grub
+sudo update-grub
+sudo reboot
+```
+
+### Remove previous logins
+
+Since our integration tests rely on past logins to count the number of reboots,
+we must ensure to clear the list.
+
+```bash
+>/var/log/lastlog
+>/var/log/wtmp
+>/var/log/btmp
+```
+
+### Clear history
+
+```
+history -c
+rm /home/cloud/.bash_history
+```
+
+### Reset cloud-init
+
+This is mandatory as we want `cloud-init` provisioning to work again when a new
+VM will be booted with this image.
+
+```
+sudo cloud-init clean
+```