back_office_ops · workflow

Canva rebuilds GPU-accelerated ML container infrastructure with Kubernetes and Nix

Canva's ML Platform team needed to rebuild their cloud GPU container base images from scratch using Nix, but their initial images failed to find the GPU at runtime, and the root cause was non-obvious across a complex stack of OS, drivers, container runtime, and image layers.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · ML workload submitted to cluster
An ML platform user submits a GPU-accelerated program to the Kubernetes cluster.
Tools used
KubernetesPyTorchTensorFlowCUDA
Outcome

Canva successfully built Nix-based GPU container images that correctly mount NVIDIA driver files and support GPU-accelerated ML workloads in production on Kubernetes, with the stack running in production for over 12 months.

What failed first

The initial Nix-built base images failed for three distinct reasons: the required NVIDIA_DRIVER_CAPABILITIES environment variable was absent, preventing the NVIDIA container runtime from mounting driver libraries; the distroless Nix container's LD_LIBRARY_PATH was incompatible with the Amazon Linux 2 host; and the images shipped without the dynamic linker, causing mapped NVIDIA utilities to report 'No such file or directory'.

Results
Time savedover 12 months
Running sinceover 12 months before publication
Source

https://www.canva.dev/blog/engineering/supporting-gpu-accelerated-machine-learning-with-kubernetes-and-nix/

How we source this →

Grounding & classification
Source type: technical build writeup
15 fields verified against source quotes, 3 dropped as unverifiable.
computer visionpersonalizationrecommendation systemfailure mode describedproduction runtime claimedtools describedworkflow describedsoftwareautomation ratetechnical build writeupback office ops