FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Journal:
arXiv
Published Date:
Apr 9, 2025
Abstract
The integration of tool learning with Large Language Models (LLMs) has
expanded their capabilities in handling complex tasks by leveraging external
tools. However, existing benchmarks for tool learning inadequately address
critical real-world personalized scenarios, particularly those requiring
multi-hop reasoning and inductive knowledge adaptation in dynamic environments.
To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a
family-based knowledge graph (KG) that simulates personalized, multi-hop tool
use scenarios. FamilyTool challenges LLMs with queries spanning 1 to 3
relational hops (e.g., inferring familial connections and preferences) and
incorporates an inductive KG setting where models must adapt to unseen user
preferences and relationships without re-training, a common limitation in prior
approaches that compromises generalization. We further propose KGETool: a
simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use
ability in these settings. Experiments reveal significant performance gaps in
state-of-the-art LLMs, with accuracy dropping sharply as hop complexity
increases and inductive scenarios exposing severe generalization deficits.
These findings underscore the limitations of current LLMs in handling
personalized, evolving real-world contexts and highlight the urgent need for
advancements in tool-learning frameworks. FamilyTool serves as a critical
resource for evaluating and advancing LLM agents' reasoning, adaptability, and
scalability in complex, dynamic environments. Code and dataset are available at
Github.