finance_ops · workflow
DABstep: Adyen and Hugging Face benchmark multi-step reasoning for data analysis agents
Real-world data analysis requires multi-step reasoning, domain knowledge, and iterative code execution, but proper evaluation benchmarks for AI agents tackling such tasks are lacking and hinder progress in the field.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Task posed to agent
A question proposes a data analysis challenge to the AI agent, including difficulty level and answer format guidelines.
Tools used
smolagentso3-miniDeepSeek R1Claude SonnetDeepSeek V3
Outcome
DABstep was released with over 450 real-world tasks from Adyen's workloads; current best-performing AI agents achieve only 16% accuracy, revealing a significant gap between current AI capability and human-level data analysis.
What failed first
Existing benchmarks were inadequate: DS-1000 tasks are single-shot without real datasets; DS Bench is Excel-based and uses GPT-4 as evaluator, introducing bias; benchmarks like GAIA, MATH, and SimpleQA can be answered with single-shot code generation.
Results
Volume16%
Cost replaced12%
Grounding & classification
Source type: technical build writeup
29 fields verified against source quotes.
agentic workflowcode generationdata extractionknowledge basefailure mode describedmetric backednamed customertools describedworkflow describedfinancial servicessoftwareaccuracy improvementemployee productivitytechnical build writeupback office opsfinance opsagentic task execution