quality_assurance · saas · workflow

Evaluating Deep Agents: LangChain's learnings on test patterns for stateful AI agents

Traditional LLM evaluation treats every datapoint identically with a shared evaluator, but deep agents require bespoke test logic per datapoint because success criteria vary and involve assertions about trajectory and state beyond just the final message.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Build bespoke test cases
Each test case has its own success criteria, requiring bespoke test logic rather than a single shared evaluator.
Tools used
LangSmithLangGraphPytestVitestDockerHarborvcrHono
Outcome

LangChain shipped four deep agent applications and distilled five evaluation patterns — bespoke test logic, single-step evals, full agent turns, multi-turn simulation, and reproducible environment setup — integrated with LangSmith for trace logging and result tracking.

What failed first

The naive approach of hardcoding sequential inputs for multi-turn agent tests breaks when the agent deviates from the expected path, making subsequent hardcoded inputs nonsensical, and standard single-evaluator eval pipelines cannot accommodate per-datapoint success criteria.

Source

https://blog.langchain.com/evaluating-deep-agents-our-learnings/

How we source this →

Grounding & classification
Source type: technical build writeup
18 fields verified against source quotes.
agentic workflowai agentcode diff prfailure mode describedproduction runtime claimedtools describedworkflow describedsoftwaretechnical build writeupquality assuranceagentic task execution