quality_assurance · workflow

Lyft builds a large-scale unsupervised ML model anomaly detection system with automated Slack alerting

ML model observability at Lyft was often neglected, and existing z-score-based anomaly detection generated too many false positives without per-scenario threshold tuning. Historically, domain-specific logic for each implementation made scaling a general-purpose solution across the organization impractical.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Data profiling with whylogs

whylogs builds profiles of various functional, integral, and distribution metrics for model prediction data in a single pass.

Tools used

whylogsStatsForecastAutoARIMAFugueMode AnalyticsSlack

Outcome

Models are automatically onboarded onto the detection system without user setup, drastically reducing turnaround time for acting on broken models. Real-time detection catches anomalous predictions within a few minutes.

What failed first

The z-score based approach generated too many false positives unless thresholds were manually adjusted per scenario, making it impractical as a general-purpose first line of defense.

Results

Time saveddrastically reduced

Source

https://eng.lyft.com/building-a-large-scale-unsupervised-model-anomaly-detection-system-part-2-3690f4c37c5b

How we source this →

Grounding & classification

Source type: technical build writeup

23 fields verified against source quotes.

anomaly detectionforecastingpredictive analyticsfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelcycle time reductionresponse time reductiontechnical build writeupincident managementquality assurancemonitor detect alert