quality_assurance · workflow
Lyft builds a large-scale unsupervised ML model anomaly detection system with automated Slack alerting
ML model observability at Lyft was often neglected, and existing z-score-based anomaly detection generated too many false positives without per-scenario threshold tuning. Historically, domain-specific logic for each implementation made scaling a general-purpose solution across the organization impractical.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Data profiling with whylogs
whylogs builds profiles of various functional, integral, and distribution metrics for model prediction data in a single pass.
Tools used
whylogsStatsForecastAutoARIMAFugueMode AnalyticsSlack
Outcome
Models are automatically onboarded onto the detection system without user setup, drastically reducing turnaround time for acting on broken models. Real-time detection catches anomalous predictions within a few minutes.
What failed first
The z-score based approach generated too many false positives unless thresholds were manually adjusted per scenario, making it impractical as a general-purpose first line of defense.
Results
Time saveddrastically reduced
Grounding & classification
Source type: technical build writeup
23 fields verified against source quotes.
anomaly detectionforecastingpredictive analyticsfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelcycle time reductionresponse time reductiontechnical build writeupincident managementquality assurancemonitor detect alert