quality_assurance · saas · workflow
Weights & Biases builds o1-based AI programming agent achieving 64.6% on SWE-Bench-Verified
Building a reliable autonomous AI programming agent required addressing o1's tendency to misorder time-sequenced events and extensive iteration over hundreds of evals to achieve consistent agent behavior.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · GitHub issue received
Each agent instance is initialized for a GitHub issue with an associated docker image and held-out unit tests.
Tools used
o1gpt4oWeavePhaseshiftEval Studio
Outcome
The o1-based agent resolves 64.6% of SWE-Bench-Verified issues, tops the leaderboard, and significantly outperforms OpenAI's own published o1 result.
What failed first
o1 exhibited a time-ordering failure mode: after a sequence of edits and test runs, it would claim a test still failed without having re-run the test following the most recent edit.
Results
Volume64.6%
Grounding & classification
Source type: technical build writeup
22 fields verified against source quotes, 1 dropped as unverifiable.
agentic workflowai agentcode generationmulti agent workflowcode diff prfailure mode describedmetric backednamed customertools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assuranceagentic task execution