quality_assurance · saas · workflow

Weights & Biases builds o1-based AI programming agent achieving 64.6% on SWE-Bench-Verified

Building a reliable autonomous AI programming agent required addressing o1's tendency to misorder time-sequenced events and extensive iteration over hundreds of evals to achieve consistent agent behavior.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · GitHub issue received

Each agent instance is initialized for a GitHub issue with an associated docker image and held-out unit tests.

Tools used

o1gpt4oWeavePhaseshiftEval Studio

Outcome

The o1-based agent resolves 64.6% of SWE-Bench-Verified issues, tops the leaderboard, and significantly outperforms OpenAI's own published o1 result.

What failed first

o1 exhibited a time-ordering failure mode: after a sequence of edits and test runs, it would claim a test still failed without having re-run the test following the most recent edit.

Results

Volume64.6%

Source

https://medium.com/@shawnup/the-best-ai-programmer-from-weights-biases-04cf8127afd8

How we source this →

Grounding & classification

Source type: technical build writeup

22 fields verified against source quotes, 1 dropped as unverifiable.

agentic workflowai agentcode generationmulti agent workflowcode diff prfailure mode describedmetric backednamed customertools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupquality assuranceagentic task execution