quality_assurance · workflow
Zalando builds LLM-as-a-judge search quality assurance framework for multi-market launches
Zalando's pre-launch search quality assurance relied entirely on human experts manually sampling and translating queries, annotating errors, and diagnosing root causes. The process was not scalable and was reactive by nature — issues were only caught after launch when real-user signals such as CTR existed. For entirely new markets, those signals did not exist at all.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Historical query clustering by NER
Past search queries from existing markets are processed by a Named Entity Recognition (NER) engine to extract attributes and cluster queries by semantic intent.
Tools used
GPT-4oApache AirflowKubernetesNakadiElasticacheNER
Outcome
The LLM-as-a-judge evaluation framework identified multiple NER and search quality issues in Portuguese and Greek markets before go-live, enabling engineers to fix them pre-launch. A full run covers 1,500 search segments with 25 results each, completes in 3-5 hours, and costs around 250 USD — compared to days of human evaluation.
Results
Time saved3-5 hours
Volume1,500
Cost replacedaround 250 USD
Running since2025
Grounding & classification
Source type: technical build writeup
32 fields verified against source quotes.
data extractionquality inspectiontranslationproduct catalogfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommerceaccuracy improvementcost reductionerror reductiontime savedtechnical build writeupecommerce opsquality assuranceextract classify routemonitor detect alert