quality_assurance · workflow

Full-Spectrum ML Model Monitoring at Lyft

Lyft's ML models make millions of high-stakes decisions per day, but model performance degrades gradually and is hard to detect. Before a centralized solution, ML engineers built one-off monitoring per model, resulting in duplicated work and no centralized visibility across hundreds of models.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Model scores emitted to metrics

For every model scoring request made to LyftLearn Serving, the system emits the model output to the metrics system.

Tools used

LyftLearn ServingGreat ExpectationsSparkFugueKubernetesVerity

Outcome

Over 90% of Lyft's production models have Feature Validation and Model Score Monitoring, and 75% have Performance Drift Detection or Anomaly Detection. The system fired hundreds of alarms and caught over 15 high-impact issues in the nine months following general availability.

What failed first

When ETA models were retrained due to COVID-related demand drops, downstream pricing models that used ETAs as inputs dramatically under-predicted, revealing an unanticipated cascading dependency between models.

Results

Time savedover 15

Volumeover 90%

Running sinceearly 2020

Source

https://eng.lyft.com/full-spectrum-ml-model-monitoring-at-lyft-a4cdaf828e8f

How we source this →

Grounding & classification

Source type: technical build writeup

24 fields verified against source quotes.

anomaly detectionfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedlogisticsautomation rateerror reductiontechnical build writeupquality assurancemonitor detect alert