FoldVision: A compute-efficient atom-level 3D protein encoder

Journal: bioRxiv
Published Date:

Abstract

Protein function emerges from three-dimensional structure, yet many large-scale protein prediction pipelines still rely solely on linear sequence embeddings. Although multiple structure-aware protein networks have been proposed, they often omit atom-level details and struggle to capture the detailed chemistry of binding sites. Here, we introduce FoldVision, a compute-efficient 3D convolutional neural network that voxelizes every heavy atom, learns rotation-robust representations, and is pre-trained on over 500,000 AlphaFold-2 structures, which is more than two orders of magnitude less data than used to train modern protein language models. Despite its compact size of 123 million parameters, FoldVision outperforms or matches state-of-the-art protein encoders on four benchmarks that require fine structural resolution: enzyme-substrate classification, transporter-substrate classification, drug-kinase inhibition, and drug-target inhibition prediction. A simple ensemble with a sequence-based model consistently improves performance across all benchmarks beyond any individual model. This indicates that FoldVision learns structural signals that are complementary to those extracted by sequence-based models. This study demonstrates that full-atom protein 3D CNNs are both tractable and superior to protein language models alone for structure-dependent tasks.

Authors

  • Kroll
  • A.; Yadav
  • S.; Lercher
  • M.

Categories