DDBJ Data Analysis Challenge: a machine learning competition to predict Arabidopsis chromatin feature annotations from DNA sequences.

Journal: Genes & genetic systems
PMID:

Abstract

Recently, the prospect of applying machine learning tools for automating the process of annotation analysis of large-scale sequences from next-generation sequencers has raised the interest of researchers. However, finding research collaborators with knowledge of machine learning techniques is difficult for many experimental life scientists. One solution to this problem is to utilise the power of crowdsourcing. In this report, we describe how we investigated the potential of crowdsourced modelling for a life science task by conducting a machine learning competition, the DNA Data Bank of Japan (DDBJ) Data Analysis Challenge. In the challenge, participants predicted chromatin feature annotations from DNA sequences with competing models. The challenge engaged 38 participants, with a cumulative total of 360 model submissions. The performance of the top model resulted in an area under the curve (AUC) score of 0.95. Over the course of the competition, the overall performance of the submitted models improved by an AUC score of 0.30 from the first submitted model. Furthermore, the 1- and 2-ranking models utilised external data such as genomic location and gene annotation information with specific domain knowledge. The effect of incorporating this domain knowledge led to improvements of approximately 5%-9%, as measured by the AUC scores. This report suggests that machine learning competitions will lead to the development of highly accurate machine learning models for use by experimental scientists unfamiliar with the complexities of data science.

Authors

  • Eli Kaminuma
    Center for Information Biology, National Institute of Genetics.
  • Yukino Baba
    Graduate School of Informatics, Kyoto University.
  • Masahiro Mochizuki
    IMSBIO Co., Ltd.
  • Hirotaka Matsumoto
    Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research.
  • Haruka Ozaki
    Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics Research.
  • Toshitsugu Okayama
    BITS Co., Ltd.
  • Takuya Kato
    Graduate School of Information Science and Technology, The University of Tokyo.
  • Shinya Oki
    Graduate School of Medical Sciences, Kyushu University.
  • Takatomo Fujisawa
    Center for Information Biology, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka, 411-08540, Japan.
  • Yasukazu Nakamura
    Center for Information Biology, National Institute of Genetics.
  • Masanori Arita
    RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, Japan.
  • Osamu Ogasawara
    Center for Information Biology, National Institute of Genetics.
  • Hisashi Kashima
    Graduate School of Informatics, Kyoto University.
  • Toshihisa Takagi
    Center for Information Biology, National Institute of Genetics.