back_office_ops · workflow

LyftLearn migrates ML training infrastructure from Kubernetes to AWS SageMaker hybrid architecture

Lyft's all-Kubernetes ML platform required custom orchestration logic for every new capability, suffered from unreliable state management via background watcher scripts, and consumed increasing engineering capacity on infrastructure rather than ML platform features.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Job submission via multiple channels
LyftLearn Service receives job requests from the LyftLearn UI, Airflow DAGs, and CI/CD pipelines.
Tools used
KubernetesAWS SageMakerJupyterLabAirflowEventBridgeSQSS3ECREKSConfidantSparkSOCIEnvoyKubeflowKatibVizier
Outcome

Migrating LyftLearn Compute to SageMaker reduced infrastructure incidents significantly, cut notebook startup times by 40–50%, eliminated idle compute costs, and freed the ML platform team to focus on platform capabilities rather than low-level infrastructure management.

What failed first

The fleet of background watcher scripts for synchronizing Kubernetes cluster state was inherently unreliable: training containers could succeed while Kubernetes marked jobs as failed, event streams timed out or arrived out of order, and container statuses transitioned inconsistently between watchers.

Results
Time saved40–50%
Cost replacedreduced by eliminating idle cluster resources
Source

https://eng.lyft.com/lyftlearn-evolution-rethinking-ml-platform-architecture-547de6c950e1

How we source this →

Grounding & classification
Source type: technical build writeup
36 fields verified against source quotes, 1 dropped as unverifiable.
anomaly detectionforecastingfraud detectionpredictive analyticsfailure mode describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelcost reductioncycle time reductionemployee productivityerror reductiontechnical build writeupback office opsmonitor detect alert