quality_assurance · saas · workflow

Trae AI achieves #1 on SWE-bench Verified with 70.6% score via multi-agent patch generation and selection

Simple LLM-based patch selection degraded in performance as the candidate sampling space grew, preventing effective use of the test-time scaling law for software issue resolution.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Issue description triggers generation

An issue description is provided as input to initiate candidate patch generation.

Tools used

Tree-sitterAgentlessClaude 3.7 Sonnet · partnerGemini 2.5 Pro · partnero4 mini · partnerOpenAI o1

Outcome

Trae's multi-agent Selector approach raised the overall SWE-bench Verified success rate to 70.6%, achieving the #1 position on the leaderboard when evaluated with Claude 3.7.

What failed first

LLM-as-a-Selector, which used OpenAI o1 to pick among candidate patches after regression filtering, peaked at small sampling sizes and then performed worse at larger ones, undermining the benefit of generating more candidates.

Results

Volume70.6%

Cost replaced60.6% to 62.6%

Source

https://www.trae.ai/blog/product_update_0528

How we source this →

Grounding & classification

Source type: technical build writeup

23 fields verified against source quotes, 5 dropped as unverifiable.

agentic workflowai agentcode generationmulti agent workflowcode diff prbuilder submittedfailure mode describedmetric backedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assuranceagentic task execution