quality_assurance · saas · workflow

Cursor's continuous improvement system for its AI coding agent harness

Building a reliable AI coding agent harness requires accurately measuring quality beyond benchmarks, catching tool-call degradations at scale, and customizing behavior per model — all while system complexity grows with each new model and capability added.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Vision-driven hypothesis
Harness improvements begin with a vision-driven opinion about the ideal agent experience, from which hypotheses are formed.
Tools used
CursorBenchAutomationCloud AgentsGenerateImageWebSearch
Outcome

Over a focused sprint, Cursor drove unexpected tool call errors down by an order of magnitude and established a continuous automated loop for detecting, investigating, and fixing harness degradations.

What failed first

Early static context-engineering guardrails were deprecated as model capabilities improved, and an experiment using a more expensive model for context summarization showed negligible quality improvement.

Results
Volumedown by an order of magnitude
Running sincelate 2024
Source

https://cursor.com/blog/continually-improving-agent-harness

How we source this →

Grounding & classification
Source type: technical build writeup
30 fields verified against source quotes.
agentic workflowai agentanomaly detectioncode generationmulti agent workflowsummarizationcode diff prknowledge basefailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareerror reductiontechnical build writeupincident managementquality assuranceagentic task executionmonitor detect alert