ecommerce_ops · ecommerce · workflow

Zalando leverages Multimodal LLMs as judge for large-scale product retrieval evaluation

Evaluating product search relevance at scale is essential for e-commerce platforms but traditionally relies on human relevance assessments that require substantial time and resources, making large-scale multilingual evaluation impractical.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Search log query extraction

Query-product pairs are extracted from search logs for evaluation.

Tools used

GPT-4oGPT-4 TurboGPT-3.5 Turbo

Outcome

The deployed framework achieves relevance assessment quality comparable to human annotations at up to 1,000 times lower cost, evaluating 20,000 query-product pairs in around 20 minutes, and enables continuous production monitoring at Zalando.

Results

Time savedaround 20 minutes

Volume50%

Cost replacedup to 1,000 times cheaper than human labor

Source

https://engineering.zalando.com/posts/2024/11/llm-as-a-judge-relevance-assessment-paper-announcement.html

How we source this →

Grounding & classification

Source type: technical build writeup

30 fields verified against source quotes.

computer visioncontent generationdocument classificationquality inspectionproduct cataloghuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedecommerceaccuracy improvementcost reductioncycle time reductionemployee productivitytechnical build writeupecommerce opsquality assuranceextract classify routemonitor detect alert