h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models

Journal: bioRxiv

Published Date: Mar 3, 2026

Abstract

Background: The rapid growth of public single-cell and spatial transcriptomics repositories has shifted the main bottleneck for atlas-scale integration from data generation to metadata heterogeneity. Even when datasets are released in the AnnData H5AD format, inconsistent column naming, partial annotations, and mixed gene identifier conventions frequently prevent reproducible merging, downstream benchmarking, and reuse in foundation model training. Automated approaches that resolve semantic inconsistency while preserving biological validity are therefore essential for scalable data reuse. Results: We present h5adify, a neuro-symbolic toolkit that combines deterministic biological inference with locally deployed large language models to transform heterogeneous AnnData objects into schema-normalized, integration-ready representations. The framework performs metadata field discovery, gene identifier harmonization, optional paper-aware extraction, and consensus resolution with explicit uncertainty logging. Benchmarking four open-weight model families deployed through Ollama (Gemma, Llama, Mistral, and Qwen) demonstrates that small local models achieve high semantic accuracy in metadata resolution with low hallucination rates and modest computational requirements. In controlled simulations introducing annotation noise into single-cell and Visium-like datasets, harmonization improves integration benchmarking and reduces spurious batch effects. Application to sex-annotated glioblastoma datasets recovers biologically coherent microenvironmental patterns and cell type-specific genomic differences not explained by differential expression alone. Conclusions: Together, h5adify provides a reproducible framework for evaluating LLM-assisted biocuration and enables scalable, privacy-preserving metadata harmonization for modern single-cell atlases and foundation model pipelines. These results demonstrate that modular neuro-symbolic integration of deterministic biological inference and small local language models can effectively resolve semantic heterogeneity while remaining computationally accessible.

Authors

Rincon de la Rosa
L.; Mouazer
A.; Navidi
M.; Degroodt
E.; Künzle
T.; Geny
S.; Idbaih
A.; Verrault
M.; Labreche
K.; Hernandez-Verdin
I.; Alentorn
A.

External Resources

View on bioRxiv Access via DOI

h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

h5adify: neuro-symbolic metadata harmonizationenables scalable AnnData integration with locallarge language models

Abstract

Authors

Categories

External Resources

Stay Ahead of Medical AI

Popular Topics

Recent Journals