UniSite: The First Cross-Structure Dataset and Learning Framework for End-to-End Ligand Binding Site Detection
Journal:
arXiv
Published Date:
Jun 3, 2025
Abstract
The detection of ligand binding sites for proteins is a fundamental step in
Structure-Based Drug Design. Despite notable advances in recent years, existing
methods, datasets, and evaluation metrics are confronted with several key
challenges: (1) current datasets and methods are centered on individual
protein-ligand complexes and neglect that diverse binding sites may exist
across multiple complexes of the same protein, introducing significant
statistical bias; (2) ligand binding site detection is typically modeled as a
discontinuous workflow, employing binary segmentation and subsequent clustering
algorithms; (3) traditional evaluation metrics do not adequately reflect the
actual performance of different binding site prediction methods. To address
these issues, we first introduce UniSite-DS, the first UniProt (Unique
Protein)-centric ligand binding site dataset, which contains 4.81 times more
multi-site data and 2.08 times more overall data compared to the previously
most widely used datasets. We then propose UniSite, the first end-to-end ligand
binding site detection framework supervised by set prediction loss with
bijective matching. In addition, we introduce Average Precision based on
Intersection over Union (IoU) as a more accurate evaluation metric for ligand
binding site prediction. Extensive experiments on UniSite-DS and several
representative benchmark datasets demonstrate that IoU-based Average Precision
provides a more accurate reflection of prediction quality, and that UniSite
outperforms current state-of-the-art methods in ligand binding site detection.
The dataset and codes will be made publicly available at
https://github.com/quanlin-wu/unisite.