Evaluating Deep Agents: LangChain's learnings on test patterns for stateful AI agents
Traditional LLM evaluation treats every datapoint identically with a shared evaluator, but deep agents require bespoke test logic per datapoint because success criteria vary and involve assertions about trajectory and state beyond just the final message.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Build bespoke test cases
Each test case has its own success criteria, requiring bespoke test logic rather than a single shared evaluator.
Tools used
LangSmithLangGraphPytestVitestDockerHarborvcrHono
Outcome
LangChain shipped four deep agent applications and distilled five evaluation patterns — bespoke test logic, single-step evals, full agent turns, multi-turn simulation, and reproducible environment setup — integrated with LangSmith for trace logging and result tracking.
What failed first
The naive approach of hardcoding sequential inputs for multi-turn agent tests breaks when the agent deviates from the expected path, making subsequent hardcoded inputs nonsensical, and standard single-evaluator eval pipelines cannot accommodate per-datapoint success criteria.