quality_assurance · ecommerce · workflow
DoorDash builds AutoEval: LLM-powered automated search relevance evaluation at scale
DoorDash's search quality evaluation relied on human annotation that could not scale: annotation cycles took days or weeks, individual raters interpreted guidelines differently causing label noise, and datasets overrepresented high-frequency queries while underrepresenting tail queries where relevance problems hide.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Query sampling from live traffic
Real user queries are sampled from live traffic across intent, frequency, geographic, and daypart dimensions.
Tools used
LLMsAutoEvalOpenAI
Outcome
AutoEval reduced relevance judgment turnaround time by 98% compared to human evaluation and unlocked a nine-fold increase in capacity, while fine-tuned LLMs consistently match or outperform external raters in key relevance tasks. Expert raters were freed from repetitive labeling to focus on guideline development and edge case resolution.
Results
Time saved98% reduction
Volumeconsistently match or outperform external raters
Grounding & classification
Source type: technical build writeup
26 fields verified against source quotes.
quality inspectionproduct cataloghuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommerceaccuracy improvementcycle time reductionemployee productivitythroughput increasetechnical build writeupecommerce opsquality assurancehuman review queuemonitor detect alert