h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models
Journal:
bioRxiv
Published Date:
Mar 3, 2026
Abstract
Background: The rapid growth of public single-cell and spatial transcriptomics repositories has shifted the main bottleneck for atlas-scale integration from data generation to metadata heterogeneity. Even when datasets are released in the AnnData H5AD format, inconsistent column naming, partial annotations, and mixed gene identifier conventions frequently prevent reproducible merging, downstream benchmarking, and reuse in foundation model training. Automated approaches that resolve semantic inconsistency while preserving biological validity are therefore essential for scalable data reuse. Results: We present h5adify, a neuro-symbolic toolkit that combines deterministic biological inference with locally deployed large language models to transform heterogeneous AnnData objects into schema-normalized, integration-ready representations. The framework performs metadata field discovery, gene identifier harmonization, optional paper-aware extraction, and consensus resolution with explicit uncertainty logging. Benchmarking four open-weight model families deployed through Ollama (Gemma, Llama, Mistral, and Qwen) demonstrates that small local models achieve high semantic accuracy in metadata resolution with low hallucination rates and modest computational requirements. In controlled simulations introducing annotation noise into single-cell and Visium-like datasets, harmonization improves integration benchmarking and reduces spurious batch effects. Application to sex-annotated glioblastoma datasets recovers biologically coherent microenvironmental patterns and cell type-specific genomic differences not explained by differential expression alone. Conclusions: Together, h5adify provides a reproducible framework for evaluating LLM-assisted biocuration and enables scalable, privacy-preserving metadata harmonization for modern single-cell atlases and foundation model pipelines. These results demonstrate that modular neuro-symbolic integration of deterministic biological inference and small local language models can effectively resolve semantic heterogeneity while remaining computationally accessible.