quality_assurance · travel · workflow

AI Agent Evaluation: Practical Tips at Booking.com

LLM agents require a more complex evaluation process than single LLMs because they use external tools, iterate through intermediate steps, and make autonomous decisions — none of which standard prompt-response evaluation captures.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · User-agent chat session

A chat session between the user and the agent is initiated, during which the user may ask the agent to perform several tasks.

Tools used

judge LLMJSON schemaBooking AI Trip Planner

Outcome

Booking.com developed a dual evaluation framework combining black box task completion scoring via judge LLMs and glass box tool proficiency and reliability checks, enabling data-driven deployment decisions that weigh performance uplift against increased cost and latency.

Results

Volumebelow 20%

Source

https://booking.ai/ai-agent-evaluation-82e781439d97?source=rss----4d265f07defc---4

How we source this →

Grounding & classification

Source type: technical build writeup

17 fields verified against source quotes.

agentic workflowai agentchat transcriptnamed customerproduction runtime claimedsource backedtools describedworkflow describedtravelaccuracy improvementtechnical build writeupquality assuranceagentic task execution