ecommerce_ops · ecommerce · workflow

Zalando leverages Multimodal LLMs as judge for large-scale product retrieval evaluation

Evaluating product search relevance at scale is essential for e-commerce platforms but traditionally relies on human relevance assessments that require substantial time and resources, making large-scale multilingual evaluation impractical.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Search log query extraction
Query-product pairs are extracted from search logs for evaluation.
Tools used
GPT-4oGPT-4 TurboGPT-3.5 Turbo
Outcome

The deployed framework achieves relevance assessment quality comparable to human annotations at up to 1,000 times lower cost, evaluating 20,000 query-product pairs in around 20 minutes, and enables continuous production monitoring at Zalando.

Results
Time savedaround 20 minutes
Volume50%
Cost replacedup to 1,000 times cheaper than human labor
Source

https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html

How we source this →

Grounding & classification
Source type: technical build writeup
30 fields verified against source quotes.
computer visioncontent generationdocument classificationquality inspectionproduct cataloghuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedecommerceaccuracy improvementcost reductioncycle time reductionemployee productivitytechnical build writeupecommerce opsquality assuranceextract classify routemonitor detect alert