back_office_ops · workflow

Dropbox Dash scales search relevance labeling with LLM-human hybrid pipeline

Training Dash's search ranking model required high-quality relevance labels at scale, but human labeling was expensive, inconsistent, unable to access sensitive customer data, and impractical at the volumes needed.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User submits search query
When a user submits a query, Dash interprets the underlying information need and determines how to retrieve relevant content.
Tools used
Dropbox DashLLMXGBoostDSPy
Outcome

By combining a small human-labeled reference set with LLM-based evaluation and iterative prompt optimization via DSPy, Dropbox now generates hundreds of thousands to millions of relevance labels to train Dash's ranking model, with measurable MSE improvement over time.

What failed first

Neither approach was sufficient alone: human labeling could not scale to the volumes required, and LLMs required careful human calibration before generating reliable relevance judgments. Using LLMs at query time was also infeasible due to latency and context window constraints.

Results
Volumehundreds of thousands—or even millions—of relevance labels
Source

https://dropbox.tech/machine-learning/llm-human-labeling-improving-search-relevance-dropbox-dash

How we source this →

Grounding & classification
Source type: technical build writeup
20 fields verified against source quotes.
enterprise searchragknowledge basehuman review describednamed customerproduction runtime claimedtools describedvendor confirmedworkflow describedsoftwareaccuracy improvementthroughput increasetechnical build writeupback office opshuman review queuerag answering