quality_assurance · saas · workflow

Dropbox's evaluation-first blueprint for Dash: LLM judges, automated gates, and continuous improvement at scale

Building Dropbox Dash exposed a fundamental challenge: LLM pipelines are probabilistic chains where a single prompt tweak can silently break production quality, and early ad-hoc evaluation gave no reliable way to catch regressions before they reached users.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · PR triggers automated evaluation

Every pull request kicks off about 150 canonical queries judged automatically with results returning in under ten minutes.

Tools used

BraintrustKubeflowGitHub ActionNatural QuestionsMS MARCOMuSiQue

Outcome

Dropbox established an evaluation-first engineering culture where automated gates catch regressions at the pull-request level before code can merge, and live production traffic is continuously scored to detect silent degradations.

What failed first

Traditional NLP metrics like BLEU and ROUGE failed to detect hallucinations and missed citations; spreadsheet-based experiment tracking broke down under real experimentation; and unstructured prompt changes caused surprise regressions that slipped into production.

Results

Time saved5s

Volume0.85

Source

https://dropbox.tech/machine-learning/practical-blueprint-evaluating-conversational-ai-at-scale-dash

How we source this →

Grounding & classification

Source type: technical build writeup

27 fields verified against source quotes.

conversational aiknowledge searchragsummarizationknowledge basefailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionerror reductiontechnical build writeupquality assurancerag answering