back_office_ops · workflow

Dropbox Dash scales search relevance labeling with LLM-human hybrid pipeline

Training Dash's search ranking model required high-quality relevance labels at scale, but human labeling was expensive, inconsistent, unable to access sensitive customer data, and impractical at the volumes needed.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · User submits search query

When a user submits a query, Dash interprets the underlying information need and determines how to retrieve relevant content.

Tools used

Dropbox DashLLMXGBoostDSPy

Outcome

By combining a small human-labeled reference set with LLM-based evaluation and iterative prompt optimization via DSPy, Dropbox now generates hundreds of thousands to millions of relevance labels to train Dash's ranking model, with measurable MSE improvement over time.

What failed first

Neither approach was sufficient alone: human labeling could not scale to the volumes required, and LLMs required careful human calibration before generating reliable relevance judgments. Using LLMs at query time was also infeasible due to latency and context window constraints.

Results

Volumehundreds of thousands—or even millions—of relevance labels

Source

https://dropbox.tech/machine-learning/llm-human-labeling-improving-search-relevance-dropbox-dash

How we source this →

Grounding & classification

Source type: technical build writeup

20 fields verified against source quotes.

enterprise searchragknowledge basehuman review describednamed customerproduction runtime claimedtools describedvendor confirmedworkflow describedsoftwareaccuracy improvementthroughput increasetechnical build writeupback office opshuman review queuerag answering