diff --git a/docs/custom-image.md b/docs/custom-image.md index eec127ce2..132b25d62 100644 --- a/docs/custom-image.md +++ b/docs/custom-image.md @@ -158,3 +158,172 @@ as we might need to update the direct kernel boot command line, replacing `/dev/vda1` with the appropriate partition number. Update all references to the previous image name to the new one. + +## NVIDIA image for VFIO baremetal CI + +Here we are going to describe how to create a cloud image that contains the +necessary NVIDIA drivers for our VFIO baremetal CI. + +### Download base image + +We usually start from one of the custom cloud image we have previously created +but we can use a stock cloud image as well. + +```bash +wget https://cloud-hypervisor.azureedge.net/jammy-server-cloudimg-amd64-custom-20221118-1.raw +mv jammy-server-cloudimg-amd64-custom-20221118-1.raw jammy-server-cloudimg-amd64-nvidia.raw +``` + +### Extend the image size + +The NVIDIA drivers consume lots of space, which is why we must resize the image +before we proceed any further. + +```bash +qemu-img resize jammy-server-cloudimg-amd64-nvidia.raw 5G +``` + +### Resize the partition + +We use `parted` for fixing the GPT after the image was resized, as well as for +resizing the `Linux` partition. + +```bash +sudo parted jammy-server-cloudimg-amd64-nvidia.raw + +(parted) print +Warning: Not all of the space available to jammy-server-cloudimg-amd64-nvidia.raw +appears to be used, you can fix the GPT to use all of the space (an extra 5873664 +blocks) or continue with the current setting? +Fix/Ignore? Fix +Model: (file) +Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB +Sector size (logical/physical): 512B/512B +Partition Table: gpt +Disk Flags: + +Number Start End Size File system Name Flags +14 1049kB 5243kB 4194kB bios_grub +15 5243kB 116MB 111MB fat32 boot, esp + 1 116MB 2361MB 2245MB ext4 + +(parted) resizepart 1 5369MB +(parted) print +Model: (file) +Disk jammy-server-cloudimg-amd64-nvidia.raw: 5369MB +Sector size (logical/physical): 512B/512B +Partition Table: gpt +Disk Flags: + +Number Start End Size File system Name Flags +14 1049kB 5243kB 4194kB bios_grub +15 5243kB 116MB 111MB fat32 boot, esp + 1 116MB 5369MB 5252MB ext4 + +(parted) quit +``` + +### Create a macvtap interface + +Rely on the following [documentation](docs/macvtap-bridge.md) to set up a +macvtap interface to provide your VM with proper connectivity. + +### Boot the image + +It is particularly important to boot with a `cloud-init` disk attached to the +VM as it will automatically resize the Linux `ext4` filesystem based on the +partition that we have previously resized. + +```bash +./cloud-hypervisor \ + --kernel hypervisor-fw \ + --disk path=focal-server-cloudimg-amd64-nvidia.raw path=/tmp/ubuntu-cloudinit.img \ + --cpus boot=4 \ + --memory size=4G \ + --net fd=3,mac=$mac 3<>$"$tapdevice" +``` + +### Bring up connectivity + +If your network has a DHCP server, run the following from your VM + +```bash +sudo dhclient +``` + +But if that's not the case, let's give it an IP manually (the IP addresses +depend on your actual network) and set the DNS server IP address as well. + +```bash +sudo ip addr add 192.168.2.10/24 dev ens4 +sudo ip link set up dev ens4 +sudo ip route add default via 192.168.2.1 +sudo resolvectl dns ens4 8.8.8.8 +``` + +#### Check connectivity and update the image + +```bash +sudo apt update +sudo apt upgrade +``` + +### Install NVIDIA drivers + +The following steps and commands are referenced from the +[NVIDIA official documentation](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts) +about Tesla compute cards. + +```bash +distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') +wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb +sudo dpkg -i cuda-keyring_1.0-1_all.deb +sudo apt-key del 7fa2af80 +sudo apt update +sudo apt -y install cuda-drivers +``` + +### Check the `nvidia-smi` tool + +Quickly validate that you can find and run the `nvidia-smi` command from your +VM. At this point it should fail given no NVIDIA card has been passed through +the VM, therefore no NVIDIA driver is loaded. + +### Workaround LA57 reboot issue + +Add `reboot=a` to `GRUB_CMDLINE_LINUX` in `etc/default/grub` so that the VM +will be booted with the ACPI reboot type. This resolves a reboot issue when +running on 5-level paging systems. + +```bash +sudo vim /etc/default/grub +sudo update-grub +sudo reboot +``` + +### Remove previous logins + +Since our integration tests rely on past logins to count the number of reboots, +we must ensure to clear the list. + +```bash +>/var/log/lastlog +>/var/log/wtmp +>/var/log/btmp +``` + +### Clear history + +``` +history -c +rm /home/cloud/.bash_history +``` + +### Reset cloud-init + +This is mandatory as we want `cloud-init` provisioning to work again when a new +VM will be booted with this image. + +``` +sudo cloud-init clean +``` \ No newline at end of file