quality_assurance · travel · workflow
AI Agent Evaluation: Practical Tips at Booking.com
LLM agents require a more complex evaluation process than single LLMs because they use external tools, iterate through intermediate steps, and make autonomous decisions — none of which standard prompt-response evaluation captures.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · User-agent chat session
A chat session between the user and the agent is initiated, during which the user may ask the agent to perform several tasks.
Tools used
judge LLMJSON schemaBooking AI Trip Planner
Outcome
Booking.com developed a dual evaluation framework combining black box task completion scoring via judge LLMs and glass box tool proficiency and reliability checks, enabling data-driven deployment decisions that weigh performance uplift against increased cost and latency.
Results
Volumebelow 20%
Grounding & classification
Source type: technical build writeup
17 fields verified against source quotes.
agentic workflowai agentchat transcriptnamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelaccuracy improvementtechnical build writeupquality assuranceagentic task execution