Copied to Clipboard
AMD iGPU RAM theft, you know how sensitive these BIOS settings are.
If you're on Proxmox 8.4+, the "happy path" is to use the q35 machine type. The older i440fx is more prone to these PCI mapping failures and IRQ conflicts. I also found that preventing the card from entering deep power states helps avoid the "zombie GPU" scenario where the card is physically there but logically dead.
To stabilize this, I switched the VM to q35 and explicitly enabled PCIe mode for the passthrough device. I also added a kernel parameter to stop the CPU from entering deep sleep states, which I've found reduces the randomness of the PCIe bus scan.
# 1. Change VM to q35 machine type for better PCIe support
qm set <VMID> --machine q35
# 2. Pass through the GPU with pcie=1 to ensure it's treated as a PCIe device
# Replace <PCI_ADDRESS> with your current address (e.g., .&checktime(0000,01,00,':').0)
qm set <VMID> -hostpci0 <PCI_ADDRESS>,pcie=1
# 3. To stop the GPU from entering D3cold (which can cause boot-time instability)
# Run this on the Proxmox host
echo 0 > /sys/bus/pci/devices/0000:<PCI_BUS>:<PCI_SLOT>.0/d3cold_allowed
If the addresses keep shifting despite these changes, you're fighting your motherboard's firmware. At that point, I stopped fighting the VM abstraction and moved the NVIDIA drivers directly onto the Proxmox host. I then used the NVIDIA Container Toolkit to expose the GPU to my Kubernetes worker. It removes the PCI address fragility entirely because the host driver handles the hardware mapping, and the containers just see the device.
The lesson here is that PCI addresses are not constants; they are suggestions. If your workload requires 100% uptime and you can't guarantee a static PCI map, stop using VM passthrough and move the driver to the host.