BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
Journal:
arXiv
Published Date:
May 28, 2025
Abstract
Biomedical reasoning often requires traversing interconnected relationships
across entities such as drugs, diseases, and proteins. Despite the increasing
prominence of large language models (LLMs), existing benchmarks lack the
ability to evaluate multi-hop reasoning in the biomedical domain, particularly
for queries involving one-to-many and many-to-many relationships. This gap
leaves the critical challenges of biomedical multi-hop reasoning underexplored.
To address this, we introduce BioHopR, a novel benchmark designed to evaluate
multi-hop, multi-answer reasoning in structured biomedical knowledge graphs.
Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop
reasoning tasks that reflect real-world biomedical complexities.
Evaluations of state-of-the-art models reveal that O3-mini, a proprietary
reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on
2-hop tasks, outperforming proprietary models such as GPT4O and open-source
biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all
models exhibit significant declines in multi-hop performance, underscoring the
challenges of resolving implicit reasoning steps in the biomedical domain. By
addressing the lack of benchmarks for multi-hop reasoning in biomedical domain,
BioHopR sets a new standard for evaluating reasoning capabilities and
highlights critical gaps between proprietary and open-source models while
paving the way for future advancements in biomedical LLMs.