quality_assurance · saas · workflow

Trae AI achieves #1 on SWE-bench Verified with 70.6% score via multi-agent patch generation and selection

Simple LLM-based patch selection degraded in performance as the candidate sampling space grew, preventing effective use of the test-time scaling law for software issue resolution.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Issue description triggers generation
An issue description is provided as input to initiate candidate patch generation.
Tools used
Tree-sitterAgentlessClaude 3.7 Sonnet · partnerGemini 2.5 Pro · partnero4 mini · partnerOpenAI o1
Outcome

Trae's multi-agent Selector approach raised the overall SWE-bench Verified success rate to 70.6%, achieving the #1 position on the leaderboard when evaluated with Claude 3.7.

What failed first

LLM-as-a-Selector, which used OpenAI o1 to pick among candidate patches after regression filtering, peaked at small sampling sizes and then performed worse at larger ones, undermining the benefit of generating more candidates.

Results
Volume70.6%
Cost replaced60.6% to 62.6%
Source

https://www.trae.ai/blog/product_update_0528

How we source this →

Grounding & classification
Source type: technical build writeup
23 fields verified against source quotes, 5 dropped as unverifiable.
agentic workflowai agentcode generationmulti agent workflowcode diff prbuilder submittedfailure mode describedmetric backedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assuranceagentic task execution