quality_assurance · saas · workflow

Dropbox's evaluation-first blueprint for Dash: LLM judges, automated gates, and continuous improvement at scale

Building Dropbox Dash exposed a fundamental challenge: LLM pipelines are probabilistic chains where a single prompt tweak can silently break production quality, and early ad-hoc evaluation gave no reliable way to catch regressions before they reached users.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · PR triggers automated evaluation
Every pull request kicks off about 150 canonical queries judged automatically with results returning in under ten minutes.
Tools used
BraintrustKubeflowGitHub ActionNatural QuestionsMS MARCOMuSiQue
Outcome

Dropbox established an evaluation-first engineering culture where automated gates catch regressions at the pull-request level before code can merge, and live production traffic is continuously scored to detect silent degradations.

What failed first

Traditional NLP metrics like BLEU and ROUGE failed to detect hallucinations and missed citations; spreadsheet-based experiment tracking broke down under real experimentation; and unstructured prompt changes caused surprise regressions that slipped into production.

Results
Time saved5s
Volume0.85
Source

https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash

How we source this →

Grounding & classification
Source type: technical build writeup
27 fields verified against source quotes.
conversational aiknowledge searchragsummarizationknowledge basefailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionerror reductiontechnical build writeupquality assurancerag answering