LyftLearn migrates ML training infrastructure from Kubernetes to AWS SageMaker hybrid architecture
Lyft's all-Kubernetes ML platform required custom orchestration logic for every new capability, suffered from unreliable state management via background watcher scripts, and consumed increasing engineering capacity on infrastructure rather than ML platform features.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Job submission via multiple channels
LyftLearn Service receives job requests from the LyftLearn UI, Airflow DAGs, and CI/CD pipelines.
Migrating LyftLearn Compute to SageMaker reduced infrastructure incidents significantly, cut notebook startup times by 40–50%, eliminated idle compute costs, and freed the ML platform team to focus on platform capabilities rather than low-level infrastructure management.
What failed first
The fleet of background watcher scripts for synchronizing Kubernetes cluster state was inherently unreliable: training containers could succeed while Kubernetes marked jobs as failed, event streams timed out or arrived out of order, and container statuses transitioned inconsistently between watchers.
Results
Time saved40–50%
Cost replacedreduced by eliminating idle cluster resources