LangChain improves coding agent 13.7 points on Terminal Bench 2.0 through harness engineering
LangChain's coding agent scored 52.8% on Terminal Bench 2.0, placing it just outside the Top 30, with identified failure modes including reasoning errors, not following task instructions, missing verification, and doom loops of repeated failed approaches.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Run agent, collect traces
Every agent action is stored in LangSmith during benchmark runs.
By changing only the harness without modifying the underlying model, LangChain improved their coding agent from 52.8 to 66.5 on Terminal Bench 2.0, moving from outside Top 30 to Top 5. Automated trace analysis saved hours of time in the improvement process.
What failed first
The most common failure pattern was that the agent wrote a solution, re-read its own code, confirmed it looked okay, and stopped without proper verification. Agents also got stuck in doom loops making small variations to the same broken approach.