Canva rebuilds GPU-accelerated ML container infrastructure with Kubernetes and Nix
Canva's ML Platform team needed to rebuild their cloud GPU container base images from scratch using Nix, but their initial images failed to find the GPU at runtime, and the root cause was non-obvious across a complex stack of OS, drivers, container runtime, and image layers.
Canva successfully built Nix-based GPU container images that correctly mount NVIDIA driver files and support GPU-accelerated ML workloads in production on Kubernetes, with the stack running in production for over 12 months.
The initial Nix-built base images failed for three distinct reasons: the required NVIDIA_DRIVER_CAPABILITIES environment variable was absent, preventing the NVIDIA container runtime from mounting driver libraries; the distroless Nix container's LD_LIBRARY_PATH was incompatible with the Amazon Linux 2 host; and the images shipped without the dynamic linker, causing mapped NVIDIA utilities to report 'No such file or directory'.