back_office_ops · workflow

Distributed Machine Learning at Instacart: training thousands of models with Ray

Instacart's legacy Celery-based distributed task queue could not scale efficiently as ML use cases grew: worker CPU utilization was only 10–15%, queues accumulated 300–1k+ tasks leading to long wait times, and managing Python dependencies across model types was difficult.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · ML application submitted

Users package code and launch it on remote AWS EKS hosts through an internal launcher API that abstracts Ray Cluster access.

Tools used

RayGriffinCeleryAirflowRay Job APIRay Core APIsRay AIRRay ServeKubernetes

Outcome

After migrating to Ray-based distributed ML, CPU utilization rose from 10–15% to up to 80%, and end-to-end completion time for a production fulfillment model dropped from approximately 4 hours to 20 minutes, with concurrent workers rising from 10 to 70+.

What failed first

The Celery-based legacy system over-provisioned worker nodes (which still sat idle over 60% of the day), could not increase concurrency without upscaling hardware, and was too complex to replicate locally for testing.

Results

Time saved~4 hours

Volume10% to 15%

Cost replacedreduces costs on computation resources

Source

https://tech.instacart.com/distributed-machine-learning-at-instacart-4b11d7569423

How we source this →

Grounding & classification

Source type: technical build writeup

34 fields verified against source quotes, 2 dropped as unverifiable.

forecastingpredictive analyticsbuilder submittedfailure mode describedmetric backednamed customerproduction runtime claimedworkflow describedecommercecost reductionemployee productivitythroughput increasetime savedtechnical build writeupback office opsagentic task execution