Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning.

Journal: Nucleic acids research
PMID:

Abstract

Machine learning (ML) has shown great potential in the adaptive immune receptor repertoire (AIRR) field. However, there is a lack of large-scale ground-truth experimental AIRR data suitable for AIRR-ML-based disease diagnostics and therapeutics discovery. Simulated ground-truth AIRR data are required to complement the development and benchmarking of robust and interpretable AIRR-ML methods where experimental data is currently inaccessible or insufficient. The challenge for simulated data to be useful is incorporating key features observed in experimental repertoires. These features, such as antigen or disease-associated immune information, cause AIRR-ML problems to be challenging. Here, we introduce LIgO, a software suite, which simulates AIRR data for the development and benchmarking of AIRR-ML methods. LIgO incorporates different types of immune information both on the receptor and the repertoire level and preserves native-like generation probability distribution. Additionally, LIgO assists users in determining the computational feasibility of their simulations. We show two examples where LIgO supports the development and validation of AIRR-ML methods: (i) how individuals carrying out-of-distribution immune information impacts receptor-level prediction performance and (ii) how immune information co-occurring in the same AIRs impacts the performance of conventional receptor-level encoding and repertoire-level classification approaches. LIgO guides the advancement and assessment of interpretable AIRR-ML methods.

Authors

  • Maria Chernigovskaya
    Department of Immunology, Oslo University Hospital Rikshospitalet and University of Oslo, Norway.
  • Milena Pavlovic
    UiO: RealArt Convergence Environment, University of Oslo, Oslo, Norway.
  • Chakravarthi Kanduri
    Centre for Bioinformatics, Department of Informatics, University of Oslo, Oslo 0373, Norway.
  • Sofie Gielis
    Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing (AUDACIS), University of Antwerp, Antwerp, Belgium; Biomedical Informatics Research Network Antwerp (Biomina), University of Antwerp, Antwerp, Belgium.
  • Philippe A Robert
    Department of Immunology, Oslo University Hospital Rikshospitalet and University of Oslo, Norway.
  • Lonneke Scheffer
    Department of Informatics, University of Oslo, Oslo, Norway.
  • Andrei Slabodkin
    Department of Immunology, Oslo University Hospital Rikshospitalet and University of Oslo, Norway.
  • Ingrid Hobæk Haff
    Department of Mathematics, University of Oslo, Oslo, 0851, Norway.
  • Pieter Meysman
    Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing (AUDACIS), University of Antwerp, Antwerp, Belgium; Biomedical Informatics Research Network Antwerp (Biomina), University of Antwerp, Antwerp, Belgium.
  • Gur Yaari
    Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel.
  • Geir Kjetil Sandve
    UiO: RealArt Convergence Environment, University of Oslo, Oslo, Norway.
  • Victor Greiff
    Department of Immunology, Oslo University Hospital, Oslo, Norway.