quality_assurance · ecommerce · workflow

DoorDash builds AutoEval: LLM-powered automated search relevance evaluation at scale

DoorDash's search quality evaluation relied on human annotation that could not scale: annotation cycles took days or weeks, individual raters interpreted guidelines differently causing label noise, and datasets overrepresented high-frequency queries while underrepresenting tail queries where relevance problems hide.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Query sampling from live traffic

Real user queries are sampled from live traffic across intent, frequency, geographic, and daypart dimensions.

Tools used

LLMsAutoEvalOpenAI

Outcome

AutoEval reduced relevance judgment turnaround time by 98% compared to human evaluation and unlocked a nine-fold increase in capacity, while fine-tuned LLMs consistently match or outperform external raters in key relevance tasks. Expert raters were freed from repetitive labeling to focus on guideline development and edge case resolution.

Results

Time saved98% reduction

Volumeconsistently match or outperform external raters

Source

https://careersatdoordash.com/blog/doordash-llms-to-evaluate-search-result-pages/

How we source this →

Grounding & classification

Source type: technical build writeup

26 fields verified against source quotes.

quality inspectionproduct cataloghuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommerceaccuracy improvementcycle time reductionemployee productivitythroughput increasetechnical build writeupecommerce opsquality assurancehuman review queuemonitor detect alert