quality_assurance · workflow

Lyft builds distributed ML model profiling pipeline for large-scale anomaly detection

LyftLearn hosts a large and growing number of ML models making hundreds of millions of predictions daily, with features and traffic patterns varying so widely across models that a single monitoring logic could not cover them all — and a prior z-score approach generated too many false positives to be useful.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Inference request sampling
All model inference requests are instrumented, sampled, and stored for downstream profiling.
Tools used
SparkWhylogsFugueKubernetesHive
Outcome

All model features and predictions are now automatically profiled daily via a distributed pipeline built on Spark, Fugue, and Whylogs, with new models onboarded automatically and no manual action required.

What failed first

A prior z-score-based anomaly detection approach produced too many false positives because model features and predictions can deviate statistically without implying a real problem, with seasonality being a key cause.

Source

https://eng.lyft.com/building-a-large-scale-unsupervised-model-anomaly-detection-system-part-1-aca4766a823c

How we source this →

Grounding & classification
Source type: technical build writeup
18 fields verified against source quotes.
anomaly detectionfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedlogisticsautomation ratetechnical build writeupquality assurancemonitor detect alert