back_office_ops · workflow
Distributed Machine Learning at Instacart: training thousands of models with Ray
Instacart's legacy Celery-based distributed task queue could not scale efficiently as ML use cases grew: worker CPU utilization was only 10–15%, queues accumulated 300–1k+ tasks leading to long wait times, and managing Python dependencies across model types was difficult.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · ML application submitted
Users package code and launch it on remote AWS EKS hosts through an internal launcher API that abstracts Ray Cluster access.
Tools used
RayGriffinCeleryAirflowRay Job APIRay Core APIsRay AIRRay ServeKubernetes
Outcome
After migrating to Ray-based distributed ML, CPU utilization rose from 10–15% to up to 80%, and end-to-end completion time for a production fulfillment model dropped from approximately 4 hours to 20 minutes, with concurrent workers rising from 10 to 70+.
What failed first
The Celery-based legacy system over-provisioned worker nodes (which still sat idle over 60% of the day), could not increase concurrency without upscaling hardware, and was too complex to replicate locally for testing.
Results
Time saved~4 hours
Volume10% to 15%
Cost replacedreduces costs on computation resources
Grounding & classification
Source type: technical build writeup
34 fields verified against source quotes, 2 dropped as unverifiable.
forecastingpredictive analyticsbuilder submittedfailure mode describedmetric backednamed customerproduction runtime claimedworkflow describedecommercecost reductionemployee productivitythroughput increasetime savedtechnical build writeupback office opsagentic task execution