quality_assurance · saas · workflow

Anthropic: Infrastructure resource configuration shifts agentic coding benchmark scores by up to 6 percentage points

Infrastructure configuration alone can produce benchmark score differences exceeding the margins between frontier models on agentic coding evals; Anthropic's Kubernetes setup enforced resource specs as both floor and ceiling, causing OOM kills from transient spikes and infra error rates as high as 6% of tasks.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Agent receives coding task

An AI model is given a full environment where it writes programs, runs tests, installs dependencies, and iterates over multiple turns.

Tools used

Terminal-Bench 2.0Google Kubernetes EngineSWE-benchClaude

Outcome

Across six resource configurations, infra error rates fell from 5.8% to 0.5% and success scores rose by 6 percentage points from strict to uncapped allocation; Anthropic recommends evals specify separate guaranteed allocation and hard kill threshold parameters per task.

What failed first

Setting the guaranteed resource allocation equal to the hard kill threshold left zero headroom for transient memory spikes, causing spurious OOM kills for containers that would otherwise have succeeded.

Results

Volume6%

Source

https://www.anthropic.com/engineering/infrastructure-noise

How we source this →

Grounding & classification

Source type: technical build writeup

25 fields verified against source quotes.

agentic workflowcode generationcode diff prfailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupquality assurancemonitor detect alert