back_office_ops · workflow

Stripe builds a benchmark to evaluate AI agents on real Stripe integrations

There was no established way to measure whether AI agents could autonomously complete long-horizon, end-to-end Stripe integrations, given the unquantified gap between LLM coding capability and the ability to manage full software engineering projects requiring planning, persistent state management, and failure recovery.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Task environment provided to agent

The agent receives a full coding environment with code, databases, scripts, and test Stripe API keys representing a typical integration starting repository.

Tools used

gooseClaude Opus 4.5GPT-5.2

Outcome

Claude Opus 4.5 achieved a 92% average score across four full-stack tasks, GPT-5.2 achieved 73% across two gym problem sets, and best-performing runs averaged 63 turns; agents navigated UIs, debugged live issues, and handled underdocumented API behavior.

What failed first

Agents mishandled ambiguous situations by treating invalid API error responses as successful completions, and were occasionally unable to recover from browser interaction failures that a human engineer could have resolved trivially.

Results

Volume73%

Cost replaced92%

Source

https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations

How we source this →

Grounding & classification

Source type: technical build writeup

19 fields verified against source quotes, 3 dropped as unverifiable.

agentic workflowai agentcode generationcode diff prfailure mode describedmetric backedvendor confirmedfinancial servicessoftwareaccuracy improvementtechnical build writeupback office opsagentic task execution