Dropbox uses DSPy to optimize Dash relevance judge: 45% NMSE reduction and 97% fewer malformed outputs
Dropbox's relevance judge for Dash was built on an expensive state-of-the-art model that could not be scaled cost-effectively, and manually tuned prompts did not transfer cleanly to cheaper models, causing quality to drop and requiring weeks of manual iteration per model swap.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Query-document pair submitted
The relevance judge is triggered by query-document pairs that need a relevance score from 1 to 5.
DSPy-optimized prompts reduced NMSE by 45 percent (from 8.83 to 4.86) when adapting to gpt-oss-120b, cut model adaptation time from one to two weeks down to one to two days, enabled labeling 10 to 100 times more data at the same cost, and reduced malformed JSON outputs by more than 97 percent when adapting to gemma-3-12b.
What failed first
Manual prompt engineering for the relevance judge plateaued in quality and broke when transferring prompts between models. With gemma-3-12b, more than 40 percent of responses were malformed JSON, making the baseline operationally unusable.