Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations
Journal:
arXiv
Published Date:
Apr 22, 2025
Abstract
Understanding the binding specificity between T-cell receptors (TCRs) and
peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy
and vaccine development. However, current predictive models struggle with
generalization, especially in data-scarce settings and when faced with novel
epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced
Recognition Network), a deep learning framework that combines large-scale
protein language models with chemical representations of peptides. By encoding
TCR \b{eta}-chain sequences using ESM-1b and transforming peptide sequences
into SMILES strings processed by MolFormer, LANTERN captures rich biological
and chemical features critical for TCR-peptide recognition. Through extensive
benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR,
LANTERN demonstrates superior performance, particularly in zero-shot and
few-shot learning scenarios. Our model also benefits from a robust negative
sampling strategy and shows significant clustering improvements via embedding
analysis. These results highlight the potential of LANTERN to advance TCR-pMHC
binding prediction and support the development of personalized immunotherapies.