MetaMuse: A Multi-Agent AI System for Biomedical Metadata Curation and Harmonization
Journal:
bioRxiv
Published Date:
Apr 15, 2026
Abstract
Inconsistent and unstructured metadata in public biomedical repositories, such as the Gene Expression Omnibus (GEO), severely limits data discoverability and research reproducibility. To address this, we introduce MetaMuse, a modular, multi-agent artificial intelligence framework designed to autonomously extract, validate, and standardize unstructured biomedical metadata. Operating through a three-stage architecture utilizing large language model agents, specialized CuratorAgents contextually extract candidate values for specific target metadata fields. A centralized ArbitratorAgent enforces cross-field logical consistency to prevent contradictory annotations. Finally, a NormalizerAgent leveraging a domain-specific semantic search model (SapBERT) maps these free-text candidates to formal ontological terms. We evaluated MetaMuse on a gold-standard dataset of manually curated GEO samples, achieving over 95% curation accuracy across key target metadata fields, and demonstrated robust scalability on a broader dataset of 400 samples. Notably, MetaMuse avoids data hallucination by defaulting to conservative false negatives when evidence is ambiguous, thereby preserving strict data integrity. By providing a fully auditable and context-aware curation pipeline, MetaMuse offers a scalable solution for enriching public data repositories and accelerating reproducible, data-driven scientific discovery