quality_assurance · saas · workflow

Pinterest Search scales relevance evaluation with fine-tuned LLMs, reducing minimum detectable effects by an order of magnitude

Pinterest Search's relevance measurement relied on costly, slow human annotations that constrained sample sizes, making it impossible to detect heterogeneous treatment effects or small topline metric changes in A/B experiments.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Fine-tune LLM on human labels
Open-source LLMs are fine-tuned on human-annotated relevance labels to predict a 5-level query–Pin relevance score.
Tools used
XLM-RoBERTa-largeBLIPDistilBERTLlama-3–8Bmultilingual BERT-baseT5-basemDeBERTa-V3-base
Outcome

Fine-tuned LLMs replaced costly human labeling at scale, reducing MDEs to ≤ 0.25% (an order of magnitude reduction) and enabling 150,000 rows to be labeled within 30 minutes on a single GPU, while significantly cutting annotation costs and turnaround time.

Results
Volume1.3%-1.5%
Cost replacedsignificantly reduces labeling costs
Source

https://medium.com/pinterest-engineering/llm-powered-relevance-assessment-for-pinterest-search-b846489e358d

How we source this →

Grounding & classification
Source type: technical build writeup
35 fields verified against source quotes.
document classificationquality inspectionproduct catalogbuilder submittedhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedmediasoftwareaccuracy improvementcost reductioncycle time reductionemployee productivitytechnical build writeupquality assurancedata sync enrichmentextract classify route