On knowing a gene: A distributional hypothesis of gene function.

Journal: Cell systems
PMID:

Abstract

As words can have multiple meanings that depend on sentence context, genes can have various functions that depend on the surrounding biological system. This pleiotropic nature of gene function is limited by ontologies, which annotate gene functions without considering biological contexts. We contend that the gene function problem in genetics may be informed by recent technological leaps in natural language processing, in which representations of word semantics can be automatically learned from diverse language contexts. In contrast to efforts to model semantics as "is-a" relationships in the 1990s, modern distributional semantics represents words as vectors in a learned semantic space and fuels current advances in transformer-based models such as large language models and generative pre-trained transformers. A similar shift in thinking of gene functions as distributions over cellular contexts may enable a similar breakthrough in data-driven learning from large biological datasets to inform gene function.

Authors

  • Jason J Kwon
    Dana-Farber Cancer Institute and Harvard Medical School, Department of Medical Oncology, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  • Joshua Pan
    Dana-Farber Cancer Institute and Harvard Medical School, Department of Medical Oncology, Boston, MA 02215, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
  • Guadalupe Gonzalez
    Department of Surgery and Cancer, Faculty of Medicine, Imperial College London, London, SW7 2AZ, UK.
  • William C Hahn
    Dana-Farber Cancer Institute, Boston, MA, USA.
  • Marinka Zitnik
    Department of Computer Science, Stanford University.