quality_assurance · travel · workflow
LLM Evaluation: Practical Tips at Booking.com — Lessons from One Year of Judge-LLM Development
Evaluating LLM-powered applications is difficult because generative tasks often lack a single ground truth, human expert review is too slow and expensive to scale, and LLMs risk hallucination and failure to follow instructions.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Define evaluation metric
A metric definition is established with the business owner in the most unambiguous way possible.
Tools used
GPT-4.1Claude 4.0 SonnetDeepEval's G-EvalArize Phoenix
Outcome
The LLM-as-judge framework enables continuous, scalable monitoring of GenAI application performance in production with minimal human involvement, and an automated prompt engineering pipeline further reduces manual development effort.
Results
Time savedanywhere from one day to a full week
Grounding & classification
Source type: technical build writeup
16 fields verified against source quotes.
quality inspectionknowledge basenamed customerproduction runtime claimedtools describedworkflow describedtravelautomation ratetechnical build writeupquality assurancemonitor detect alert