From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
Journal:
arXiv
Published Date:
Apr 28, 2025
Abstract
Large language models and autonomous AI agents have evolved rapidly,
resulting in a diverse array of evaluation benchmarks, frameworks, and
collaboration protocols. However, the landscape remains fragmented and lacks a
unified taxonomy or comprehensive survey. Therefore, we present a side-by-side
comparison of benchmarks developed between 2019 and 2025 that evaluate these
models and agents across multiple domains. In addition, we propose a taxonomy
of approximately 60 benchmarks that cover general and academic knowledge
reasoning, mathematical problem-solving, code generation and software
engineering, factual grounding and retrieval, domain-specific evaluations,
multimodal and embodied tasks, task orchestration, and interactive assessments.
Furthermore, we review AI-agent frameworks introduced between 2023 and 2025
that integrate large language models with modular toolkits to enable autonomous
decision-making and multi-step reasoning. Moreover, we present real-world
applications of autonomous AI agents in materials science, biomedical research,
academic ideation, software engineering, synthetic data generation, chemical
reasoning, mathematical problem-solving, geographic information systems,
multimedia, healthcare, and finance. We then survey key agent-to-agent
collaboration protocols, namely the Agent Communication Protocol (ACP), the
Model Context Protocol (MCP), and the Agent-to-Agent Protocol (A2A). Finally,
we discuss recommendations for future research, focusing on advanced reasoning
strategies, failure modes in multi-agent LLM systems, automated scientific
discovery, dynamic tool integration via reinforcement learning, integrated
search capabilities, and security vulnerabilities in agent protocols.