AbSet: A Standardized Data Set of Antibody Structures for Machine Learning Applications.

Journal: Journal of chemical information and modeling
Published Date:

Abstract

Machine learning algorithms have played a fundamental role in the development of therapeutic antibodies by being trained on data sets of sequences and/or structures. However, structural data sets remain limited, especially those that include antibody-antigen complexes. Additionally, many of the available structures are not standardized, and antibody-specific databases often do not provide molecular descriptors that could enhance ML models. To address this gap, we introduce AbSet, a curated dataset comprising over 800,000 antibody structures and corresponding molecular descriptors, including both experimentally determined and in silico-generated antibody-antigen complexes. We systematically retrieved antibody structures from the Protein Data Bank (PDB), applied rigorous standardization protocols, and expanded the dataset through large-scale protein-protein docking to generate structural variants of antibody-antigen interactions. Each model was classified as high, medium, acceptable, or incorrect quality based on structural similarity to reference experimental complexes. This classification enables both the construction of a decoy set of confirmed non-binders and the generation of high-confidence augmented structural data for machine learning applications. AbSet is publicly available via the Zenodo repository, with accompanying scripts hosted on GitHub (https://github.com/SFBBGroup/AbSet.git).

Authors

  • Diego S Almeida
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • Matheus V Almeida
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • Jean V Sampaio
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • Eduardo M Gaieta
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • Andrielly H S Costa
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • Francisco F A Rabelo
    Universidade Federal do Ceará, Fortaleza 60020-181, Brazil.
  • César L Cavalcante
    Universidade Federal do Ceará, Fortaleza 60020-181, Brazil.
  • Geraldo R Sartori
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.
  • João H M Silva
    Laboratory of Structural and Functional Biology Applied to Biopharmaceuticals, Fundação Oswaldo Cruz, Fiocruz Ceará, Eusébio 61773-270, Brazil.