back_office_ops · workflow

Stripe builds a benchmark to evaluate AI agents on real Stripe integrations

There was no established way to measure whether AI agents could autonomously complete long-horizon, end-to-end Stripe integrations, given the unquantified gap between LLM coding capability and the ability to manage full software engineering projects requiring planning, persistent state management, and failure recovery.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Task environment provided to agent
The agent receives a full coding environment with code, databases, scripts, and test Stripe API keys representing a typical integration starting repository.
Tools used
gooseClaude Opus 4.5GPT-5.2
Outcome

Claude Opus 4.5 achieved a 92% average score across four full-stack tasks, GPT-5.2 achieved 73% across two gym problem sets, and best-performing runs averaged 63 turns; agents navigated UIs, debugged live issues, and handled underdocumented API behavior.

What failed first

Agents mishandled ambiguous situations by treating invalid API error responses as successful completions, and were occasionally unable to recover from browser interaction failures that a human engineer could have resolved trivially.

Results
Volume73%
Cost replaced92%
Source

https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations

How we source this →

Grounding & classification
Source type: technical build writeup
19 fields verified against source quotes, 3 dropped as unverifiable.
agentic workflowai agentcode generationcode diff prfailure mode describedmetric backedvendor confirmedfinancial servicessoftwareaccuracy improvementtechnical build writeupback office opsagentic task execution