quality_assurance · saas · workflow

Evaluating Deep Agents: LangChain's learnings on test patterns for stateful AI agents

Traditional LLM evaluation treats every datapoint identically with a shared evaluator, but deep agents require bespoke test logic per datapoint because success criteria vary and involve assertions about trajectory and state beyond just the final message.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Build bespoke test cases

Each test case has its own success criteria, requiring bespoke test logic rather than a single shared evaluator.

Tools used

LangSmithLangGraphPytestVitestDockerHarborvcrHono

Outcome

LangChain shipped four deep agent applications and distilled five evaluation patterns — bespoke test logic, single-step evals, full agent turns, multi-turn simulation, and reproducible environment setup — integrated with LangSmith for trace logging and result tracking.

What failed first

The naive approach of hardcoding sequential inputs for multi-turn agent tests breaks when the agent deviates from the expected path, making subsequent hardcoded inputs nonsensical, and standard single-evaluator eval pipelines cannot accommodate per-datapoint success criteria.

Source

https://blog.langchain.com/evaluating-deep-agents-our-learnings/

How we source this →

Grounding & classification

Source type: technical build writeup

18 fields verified against source quotes.

agentic workflowai agentcode diff prfailure mode describedproduction runtime claimedtools describedworkflow describedsoftwaretechnical build writeupquality assuranceagentic task execution