Encoding of pretrained large language models mirrors the genetic architectures of human psychological traits

Journal: medRxiv
Published Date:

Abstract

Recent advances in large language models (LLMs) have prompted a frenzy in utilizing them as universal translators for biomedical terms. However, the black box nature of LLMs has forced researchers to rely on artificially designed benchmarks without understanding what exactly LLMs encode. We demonstrate that pretrained LLMs can already explain up to 51% of the genetic correlation between items from a psychometrically-validated neuroticism questionnaire, without any fine-tuning. For psychiatric diagnoses, we found disorder names aligned better with genetic relationships than diagnostic descriptions. Our results indicate the pretrained LLMs have encodings mirroring genetic architectures. These findings highlight LLMs’ potential for validating phenotypes, refining taxonomies, and integrating textual and genetic data in mental health research.

Authors

  • Bohan Xu; Nick Obradovich; Wenjie Zheng; Robert Loughnan; Lucy Shao; Masaya Misaki; Wesley K. Thompson; Martin Paulus; Chun Chieh Fan