finance_ops · workflow

DABstep: Adyen and Hugging Face benchmark multi-step reasoning for data analysis agents

Real-world data analysis requires multi-step reasoning, domain knowledge, and iterative code execution, but proper evaluation benchmarks for AI agents tackling such tasks are lacking and hinder progress in the field.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Task posed to agent

A question proposes a data analysis challenge to the AI agent, including difficulty level and answer format guidelines.

Tools used

smolagentso3-miniDeepSeek R1Claude SonnetDeepSeek V3

Outcome

DABstep was released with over 450 real-world tasks from Adyen's workloads; current best-performing AI agents achieve only 16% accuracy, revealing a significant gap between current AI capability and human-level data analysis.

What failed first

Existing benchmarks were inadequate: DS-1000 tasks are single-shot without real datasets; DS Bench is Excel-based and uses GPT-4 as evaluator, introducing bias; benchmarks like GAIA, MATH, and SimpleQA can be answered with single-shot code generation.

Results

Volume16%

Cost replaced12%

Source

https://medium.com/adyen/data-agent-benchmark-for-multi-step-reasoning-dabstep-70e913c339dc

How we source this →

Grounding & classification

Source type: technical build writeup

29 fields verified against source quotes.

agentic workflowcode generationdata extractionknowledge basefailure mode describedmetric backednamed customertools describedworkflow describedfinancial servicessoftwareaccuracy improvementemployee productivitytechnical build writeupback office opsfinance opsagentic task execution