quality_assurance · saas · workflow

LangChain improves coding agent 13.7 points on Terminal Bench 2.0 through harness engineering

LangChain's coding agent scored 52.8% on Terminal Bench 2.0, placing it just outside the Top 30, with identified failure modes including reasoning errors, not following task instructions, missing verification, and doom loops of repeated failed approaches.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Run agent, collect traces
Every agent action is stored in LangSmith during benchmark runs.
Tools used
deepagents-cliLangSmithTerminal Bench 2.0HarborDaytonagpt-5.2-codexPreCompletionChecklistMiddlewareLoopDetectionMiddleware
Outcome

By changing only the harness without modifying the underlying model, LangChain improved their coding agent from 52.8 to 66.5 on Terminal Bench 2.0, moving from outside Top 30 to Top 5. Automated trace analysis saved hours of time in the improvement process.

What failed first

The most common failure pattern was that the agent wrote a solution, re-read its own code, confirmed it looked okay, and stopped without proper verification. Agents also got stuck in doom loops making small variations to the same broken approach.

Results
Time savedsaves hours of time
Volume52.8
Cost replacedover 2x more tokens/time
Source

https://blog.langchain.com/improving-deep-agents-with-harness-engineering/

How we source this →

Grounding & classification
Source type: technical build writeup
27 fields verified against source quotes, 4 dropped as unverifiable.
agentic workflowcode generationmulti agent workflowcode diff prfailure mode describednamed customertools describedworkflow describedsoftwareaccuracy improvementtime savedtechnical build writeupquality assuranceagentic task execution