DoorDash builds calibrated LLM-as-a-judge to evaluate natural language search quality at scale
DoorDash's natural language search pipeline accepted vague, intent-based queries with no historical click-through ground truth, making evaluation impossible with traditional search metrics. Human annotation cycles took 2–5 days, produced inconsistent labels for compositional queries due to rubric ambiguity, and a supervised relevance model trained on those labels performed barely above chance.
DoorDash replaced periodic human annotation with a calibrated LLM judge running continuously in production monitoring and as a PR-level quality gate, enabling per-facet evaluation that surfaced real performance gaps previously invisible in aggregate NDCG scores.
A Qwen3-based reranker trained on human-annotated labels achieved only AUC 0.56 on a held-out set. A systematic audit found that 19 of 35 manually reviewed cases had incorrect human ratings after cross-functional adjudication, with disagreement exceeding 30% on the boundary cases most critical for ranking model training.
https://careersatdoordash.com/blog/doordash-llm-as-a-judge-evaluating-natural-language-search/