back_office_ops · workflow

LyftLearn migrates ML training infrastructure from Kubernetes to AWS SageMaker hybrid architecture

Lyft's all-Kubernetes ML platform required custom orchestration logic for every new capability, suffered from unreliable state management via background watcher scripts, and consumed increasing engineering capacity on infrastructure rather than ML platform features.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Job submission via multiple channels

LyftLearn Service receives job requests from the LyftLearn UI, Airflow DAGs, and CI/CD pipelines.

Tools used

KubernetesAWS SageMakerJupyterLabAirflowEventBridgeSQSS3ECREKSConfidantSparkSOCIEnvoyKubeflowKatibVizier

Outcome

Migrating LyftLearn Compute to SageMaker reduced infrastructure incidents significantly, cut notebook startup times by 40–50%, eliminated idle compute costs, and freed the ML platform team to focus on platform capabilities rather than low-level infrastructure management.

What failed first

The fleet of background watcher scripts for synchronizing Kubernetes cluster state was inherently unreliable: training containers could succeed while Kubernetes marked jobs as failed, event streams timed out or arrived out of order, and container statuses transitioned inconsistently between watchers.

Results

Time saved40–50%

Cost replacedreduced by eliminating idle cluster resources

Source

https://eng.lyft.com/lyftlearn-evolution-rethinking-ml-platform-architecture-547de6c950e1

How we source this →

Grounding & classification

Source type: technical build writeup

36 fields verified against source quotes, 1 dropped as unverifiable.

anomaly detectionforecastingfraud detectionpredictive analyticsfailure mode describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelcost reductioncycle time reductionemployee productivityerror reductiontechnical build writeupback office opsmonitor detect alert