Knowledge-guided Contextual Gene Set Analysis Using Large Language Models
Journal:
arXiv
Published Date:
Jun 4, 2025
Abstract
Gene set analysis (GSA) is a foundational approach for interpreting genomic
data of diseases by linking genes to biological processes. However,
conventional GSA methods overlook clinical context of the analyses, often
generating long lists of enriched pathways with redundant, nonspecific, or
irrelevant results. Interpreting these requires extensive, ad-hoc manual
effort, reducing both reliability and reproducibility. To address this
limitation, we introduce cGSA, a novel AI-driven framework that enhances GSA by
incorporating context-aware pathway prioritization. cGSA integrates gene
cluster detection, enrichment analysis, and large language models to identify
pathways that are not only statistically significant but also biologically
meaningful. Benchmarking on 102 manually curated gene sets across 19 diseases
and ten disease-related biological mechanisms shows that cGSA outperforms
baseline methods by over 30%, with expert validation confirming its increased
precision and interpretability. Two independent case studies in melanoma and
breast cancer further demonstrate its potential to uncover context-specific
insights and support targeted hypothesis generation.