BindSpace decodes transcription factor binding signals by large-scale sequence embedding.

Journal: Nature methods
Published Date:

Abstract

The decoding of transcription factor (TF) binding signals in genomic DNA is a fundamental problem. Here we present a prediction model called BindSpace that learns to embed DNA sequences and TF labels into the same space. By training on binding data from hundreds of TFs and embedding over 1 M DNA sequences, BindSpace achieves state-of-the-art multiclass binding prediction performance, in vitro and in vivo, and can distinguish between signals of closely related TFs.

Authors

  • Han Yuan
    Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
  • Meghana Kshirsagar
    Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
  • Lee Zamparo
    Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA.
  • Yuheng Lu
    Computational and Systems Biology Program, Memorial Sloan Kettering Cancer Center, New York, NY, USA.
  • Christina S Leslie
    Computational Biology Program, Memorial Sloan Kettering Cancer Center, New York, New York, USA.