customer_support · workflow

DoorDash builds simulation and evaluation flywheel to develop LLM support chatbots at scale

LLMs' non-determinism made safe testing of support chatbot changes impossible: deploying to production risked degrading customer and Dasher experience, while manual testing was too slow and likely to miss problems.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Identify customer problem

Human review of cases from simulation runs or live user traffic identifies issues to address and real transcripts to seed the simulator.

Tools used

LLMsS3gRPC

Outcome

The flywheel reduced hallucinations by 90% in simulation with the improvement carrying over into production, cut each iteration cycle from days to hours, and enabled more than 200 simulated conversations to run in under five minutes.

What failed first

The early LLM implementation suffered from hallucinations because the context window was overwhelmed with raw events and logs, causing the model to misinterpret fields or suggest non-existent policies; iterative attempts at summarization either lost important details or remained too noisy.

Results

Time savedreduced each iteration cycle from days to hours

Volume90%

Source

https://careersatdoordash.com/blog/doordash-simulation-evaluation-flywheel-to-develop-llm-chatbots-at-scale/

How we source this →

Grounding & classification

Source type: technical build writeup

27 fields verified against source quotes.

agentic workflowchatbotconversational aisummarizationchat transcriptbuilder submittedfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedlogisticscycle time reductionerror reductionthroughput increasetechnical build writeupcustomer supportquality assuranceautonomous resolutionhuman review queue