ecommerce_ops · workflow

DoorDash builds calibrated LLM-as-a-judge to evaluate natural language search quality at scale

DoorDash's natural language search pipeline accepted vague, intent-based queries with no historical click-through ground truth, making evaluation impossible with traditional search metrics. Human annotation cycles took 2–5 days, produced inconsistent labels for compositional queries due to rubric ambiguity, and a supervised relevance model trained on those labels performed barely above chance.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Natural language query submitted

Users submit queries expressing vague intent or multiple constraints that cannot be captured by keyword tokens alone.

Tools used

o3-miniQwen3semantic embeddings

Outcome

DoorDash replaced periodic human annotation with a calibrated LLM judge running continuously in production monitoring and as a PR-level quality gate, enabling per-facet evaluation that surfaced real performance gaps previously invisible in aggregate NDCG scores.

What failed first

A Qwen3-based reranker trained on human-annotated labels achieved only AUC 0.56 on a held-out set. A systematic audit found that 19 of 35 manually reviewed cases had incorrect human ratings after cross-functional adjudication, with disagreement exceeding 30% on the boundary cases most critical for ranking model training.

Results

Time savedtwo to five days

Volume0.56

Source

https://careersatdoordash.com/blog/doordash-llm-as-a-judge-evaluating-natural-language-search/

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes, 2 dropped as unverifiable.

enterprise searchquality inspectionragproduct catalogfailure mode describedhuman review describednamed customersource backedtools describedworkflow describedecommerceaccuracy improvementcycle time reductionerror reductiontechnical build writeupecommerce opsquality assurancehuman review queuemonitor detect alert