quality_assurance · saas · workflow

Pinterest Search scales relevance evaluation with fine-tuned LLMs, reducing minimum detectable effects by an order of magnitude

Pinterest Search's relevance measurement relied on costly, slow human annotations that constrained sample sizes, making it impossible to detect heterogeneous treatment effects or small topline metric changes in A/B experiments.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Fine-tune LLM on human labels

Open-source LLMs are fine-tuned on human-annotated relevance labels to predict a 5-level query–Pin relevance score.

Tools used

XLM-RoBERTa-largeBLIPDistilBERTLlama-3–8Bmultilingual BERT-baseT5-basemDeBERTa-V3-base

Outcome

Fine-tuned LLMs replaced costly human labeling at scale, reducing MDEs to ≤ 0.25% (an order of magnitude reduction) and enabling 150,000 rows to be labeled within 30 minutes on a single GPU, while significantly cutting annotation costs and turnaround time.

Results

Volume1.3%-1.5%

Cost replacedsignificantly reduces labeling costs

Source

https://medium.com/pinterest-engineering/llm-powered-relevance-assessment-for-pinterest-search-b846489e358d

How we source this →

Grounding & classification

Source type: technical build writeup

35 fields verified against source quotes.

document classificationquality inspectionproduct catalogbuilder submittedhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedmediasoftwareaccuracy improvementcost reductioncycle time reductionemployee productivitytechnical build writeupquality assurancedata sync enrichmentextract classify route