Stripe builds a benchmark to evaluate AI agents on real Stripe integrations
There was no established way to measure whether AI agents could autonomously complete long-horizon, end-to-end Stripe integrations, given the unquantified gap between LLM coding capability and the ability to manage full software engineering projects requiring planning, persistent state management, and failure recovery.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Task environment provided to agent
The agent receives a full coding environment with code, databases, scripts, and test Stripe API keys representing a typical integration starting repository.
Tools used
gooseClaude Opus 4.5GPT-5.2
Outcome
Claude Opus 4.5 achieved a 92% average score across four full-stack tasks, GPT-5.2 achieved 73% across two gym problem sets, and best-performing runs averaged 63 turns; agents navigated UIs, debugged live issues, and handled underdocumented API behavior.
What failed first
Agents mishandled ambiguous situations by treating invalid API error responses as successful completions, and were occasionally unable to recover from browser interaction failures that a human engineer could have resolved trivially.