quality_assurance · saas · workflow

Dropbox uses DSPy to optimize Dash relevance judge: 45% NMSE reduction and 97% fewer malformed outputs

Dropbox's relevance judge for Dash was built on an expensive state-of-the-art model that could not be scaled cost-effectively, and manually tuned prompts did not transfer cleanly to cheaper models, causing quality to drop and requiring weeks of manual iteration per model swap.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Query-document pair submitted

The relevance judge is triggered by query-document pairs that need a relevance score from 1 to 5.

Tools used

DSPyGEPAMIPROv2gpt-oss-120bgemma-3-12bo3Dropbox Dash

Outcome

DSPy-optimized prompts reduced NMSE by 45 percent (from 8.83 to 4.86) when adapting to gpt-oss-120b, cut model adaptation time from one to two weeks down to one to two days, enabled labeling 10 to 100 times more data at the same cost, and reduced malformed JSON outputs by more than 97 percent when adapting to gemma-3-12b.

What failed first

Manual prompt engineering for the relevance judge plateaued in quality and broke when transferring prompts between models. With gemma-3-12b, more than 40 percent of responses were malformed JSON, making the baseline operationally unusable.

Results

Time savedone to two weeks

Volume45 percent

Source

https://dropbox.tech/machine-learning/optimizing-dropbox-dash-relevance-judge-with-dspy

How we source this →

Grounding & classification

Source type: technical build writeup

32 fields verified against source quotes, 1 dropped as unverifiable.

enterprise searchquality inspectionknowledge basefailure mode describedhuman review describedmetric backednamed customertools describedworkflow describedsoftwareaccuracy improvementcost reductioncycle time reductionerror reductiontechnical build writeupquality assurance