Canva recommendation system: handling empty results, irrelevant outputs, and production failures at 60M+ user scale
Canva's personalization system serving over 60 million monthly active users faces two recurring failure classes: unexpected results (empty recommendations from cold-start or low model confidence, and irrelevant outputs from model imperfections) and failure to respond (high latency from large deep learning models and horizontal scaling limits hit during peak traffic while most engineers are asleep in Australia).
Canva mitigates recommendation failures through locale- and platform-specific fallbacks, near-line inference caching to keep recommendations reactive to user interactions, metric-threshold deployment gates, visual model reports for debugging, auto-scaling policies, and independent per-model controllers enabling rollback or switch-off during incidents without affecting other models.
Recommendation models have produced no results or irrelevant results; horizontal scaling limits have been hit multiple times due to Canva's fast-growing user base or new models requiring larger machines; and some models take around 15 to 20 minutes to scale, making roll-forward during incidents impractical.
https://www.canva.dev/blog/engineering/recommender-systems-when-they-fail-who-are-you-gonna-call/