Anthropic: Infrastructure resource configuration shifts agentic coding benchmark scores by up to 6 percentage points
Infrastructure configuration alone can produce benchmark score differences exceeding the margins between frontier models on agentic coding evals; Anthropic's Kubernetes setup enforced resource specs as both floor and ceiling, causing OOM kills from transient spikes and infra error rates as high as 6% of tasks.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Agent receives coding task
An AI model is given a full environment where it writes programs, runs tests, installs dependencies, and iterates over multiple turns.
Across six resource configurations, infra error rates fell from 5.8% to 0.5% and success scores rose by 6 percentage points from strict to uncapped allocation; Anthropic recommends evals specify separate guaranteed allocation and hard kill threshold parameters per task.
What failed first
Setting the guaranteed resource allocation equal to the hard kill threshold left zero headroom for transient memory spikes, causing spurious OOM kills for containers that would otherwise have succeeded.