back_office_ops · workflow

DoorDash builds a clusterless ML feature store serving 130M HMGETs per second

DoorDash's Redis-based ML feature store hit vertical scalability limits as the dataset grew, and a subsequent hybrid approach using a relational database added operational complexity without resolving the underlying scalability constraint.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Batch feature data upload to S3
All feature data is consolidated from various sources via batch job and uploaded in parquet format to an S3 bucket.
Tools used
Apache KvrocksRocksDBRedisS3Kubernetes
Outcome

The clusterless ML feature store reached production, supporting a peak load of over 130M HMGETs per second for 1.6B retrieved features within a 50ms P999 latency target, with dynamically deployed capacity that grows with business demand.

What failed first

An initial Redis-only feature store hit per-instance vertical scale limits, and a hybrid Redis-plus-relational-database approach temporarily relieved the bottleneck but became unmanageable when the dataset doubled and the relational cluster reached 1,000-plus nodes.

Results
Time saved50ms P999
Volumearound 900,000 ML evaluations per second
Cost replaced100 times more expensive
Source

https://careersatdoordash.com/blog/doordash-clusterless-ml-feature-store/

How we source this →

Grounding & classification
Source type: technical build writeup
28 fields verified against source quotes.
personalizationpredictive analyticsproduct catalogfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommercelogisticscost reductioncycle time reductionthroughput increasetechnical build writeupback office opsdata sync enrichment