quality_assurance · saas · workflow

LangChain improves coding agent 13.7 points on Terminal Bench 2.0 through harness engineering

LangChain's coding agent scored 52.8% on Terminal Bench 2.0, placing it just outside the Top 30, with identified failure modes including reasoning errors, not following task instructions, missing verification, and doom loops of repeated failed approaches.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Run agent, collect traces

Every agent action is stored in LangSmith during benchmark runs.

Tools used

deepagents-cliLangSmithTerminal Bench 2.0HarborDaytonagpt-5.2-codexPreCompletionChecklistMiddlewareLoopDetectionMiddleware

Outcome

By changing only the harness without modifying the underlying model, LangChain improved their coding agent from 52.8 to 66.5 on Terminal Bench 2.0, moving from outside Top 30 to Top 5. Automated trace analysis saved hours of time in the improvement process.

What failed first

The most common failure pattern was that the agent wrote a solution, re-read its own code, confirmed it looked okay, and stopped without proper verification. Agents also got stuck in doom loops making small variations to the same broken approach.

Results

Time savedsaves hours of time

Volume52.8

Cost replacedover 2x more tokens/time

Source

https://blog.langchain.com/improving-deep-agents-with-harness-engineering/

How we source this →

Grounding & classification

Source type: technical build writeup

27 fields verified against source quotes, 4 dropped as unverifiable.

agentic workflowcode generationmulti agent workflowcode diff prfailure mode describednamed customertools describedworkflow describedsoftwareaccuracy improvementtime savedtechnical build writeupquality assuranceagentic task execution