ecommerce_ops · ecommerce · workflow

Faire fine-tunes Llama3-8b to scale semantic search relevance measurement to 70M predictions per day

Evaluating search relevance at Faire was a manual, expensive, and slow process limited to monthly human-labeled snapshots, making it hard to scale and act on relevance signals as the search ecosystem grew more complex with personalization.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Human-labeled ground truth

A data annotation vendor labels a sample of query-product pairs monthly to produce ground truth relevance labels.

Tools used

Llama2-7bLlama2-13bGPTDeepSpeedA100 GPU

Outcome

The fine-tuned Llama3-8b model achieves a 28% improvement in Krippendorff's Alpha over the existing GPT production model and enables 70 million relevance predictions per day using 16 GPUs, making relevance a measurable and actionable dimension across all retailer search sessions.

What failed first

Prompt engineering alone could not capture Faire's definition of semantic search relevance, and the fine-tuned GPT solution was increasingly constrained by external API costs, limiting throughput for labeling.

Results

Time saved~300k query product pairs per hour

Volume70 million predictions per day

Source

https://craft.faire.com/fine-tuning-llama3-to-measure-semantic-relevance-in-search-86a7b13c24ea

How we source this →

Grounding & classification

Source type: technical build writeup

26 fields verified against source quotes, 6 dropped as unverifiable.

data extractiondocument classificationproduct catalogfailure mode describedhuman review describednamed customerworkflow describedecommerceaccuracy improvementcost reductionthroughput increasetechnical build writeupecommerce opsquality assuranceextract classify route