back_office_ops · workflow

DoorDash builds a clusterless ML feature store serving 130M HMGETs per second

DoorDash's Redis-based ML feature store hit vertical scalability limits as the dataset grew, and a subsequent hybrid approach using a relational database added operational complexity without resolving the underlying scalability constraint.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Batch feature data upload to S3

All feature data is consolidated from various sources via batch job and uploaded in parquet format to an S3 bucket.

Tools used

Apache KvrocksRocksDBRedisS3Kubernetes

Outcome

The clusterless ML feature store reached production, supporting a peak load of over 130M HMGETs per second for 1.6B retrieved features within a 50ms P999 latency target, with dynamically deployed capacity that grows with business demand.

What failed first

An initial Redis-only feature store hit per-instance vertical scale limits, and a hybrid Redis-plus-relational-database approach temporarily relieved the bottleneck but became unmanageable when the dataset doubled and the relational cluster reached 1,000-plus nodes.

Results

Time saved50ms P999

Volumearound 900,000 ML evaluations per second

Cost replaced100 times more expensive

Source

https://careersatdoordash.com/blog/doordash-clusterless-ml-feature-store/

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

personalizationpredictive analyticsproduct catalogfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommercelogisticscost reductioncycle time reductionthroughput increasetechnical build writeupback office opsdata sync enrichment