DABstep: Data Agent Benchmark for Multi-step Reasoning

Journal: arXiv

Published Date: Jun 30, 2025

Abstract

We introduce DABstep, a novel benchmark for evaluating AI agents on realistic multi-step data analysis tasks. DABstep comprises over 450 real-world challenges derived from a financial analytics platform, requiring models to combine code-based data processing with contextual reasoning over heterogeneous documentation. Each task demands an iterative, multi-step problem-solving approach, testing capabilities in data manipulation, cross-referencing multiple sources, and precise result reporting. The benchmark provides a factoid-style answer format with automatic correctness checks for objective scoring at scale. We evaluate leading LLM-based agents, revealing a substantial performance gap: even the best agent achieves only 14.55% accuracy on the hardest tasks. We detail our benchmark's design, dataset composition, task formulation, evaluation protocol, report baseline results and analyze failure modes. DABstep is released with a public leaderboard and toolkit to accelerate research in autonomous data analysis.

Authors

Alex Egg
Martin Iglesias Goyanes
Friso Kingma
Andreu Mora
Leandro von Werra
Thomas Wolf

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2506.23719v1)

DABstep: Data Agent Benchmark for Multi-step Reasoning

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

DABstep: Data Agent Benchmark for Multi-step Reasoning

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals