A Collection of Benchmark Data Sets for Knowledge Graph-based Similarity in the Biomedical Domain.

Journal: Database : the journal of biological databases and curation
Published Date:

Abstract

The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein-protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein-protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.

Authors

  • Carlota Cardoso
    Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal.
  • Rita T Sousa
    Departamento de informática, LASIGE Faculdade de Ciências da Universidade de Lisboa, 1749 - 016 Lisboa, Portugal.
  • Sebastian Köhler
    School of ITEE, The University of Queensland, St. Lucia, QLD 4072, Australia, Garvan Institute of Medical Research, Darlinghurst, Sydney, NSW 2010, Australia, Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, 13353 Berlin, Germany, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK, National Institute of Informatics, Hitotsubashi, Tokyo, Japan, Mouse Informatics Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK, LASIGE, Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal, Genetic Services of Western Australia, King Edward Memorial Hospital, WA 6008, Australia, School of Paediatrics and Child Health, University of Western Australia, WA 6008, Australia, Institute for Immunology and Infectious Diseases, Murdoch University, WA 6150, Australia, Office of Population Health, Public Health and Clinical Services Division, Western Australian Department of Health, WA 6004, Australia, Academic Department of Medical Genetics, Sydney Children's Hospitals Network (Westmead), NSW 2145, Australia, Discipline of Genetic Medicine, Sydney Medical School, The University of Sydney, NSW 2006, Australia, Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany, Institute for Bioinformatics, Department of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany and Berlin Brandenburg Center for Regenerative Therapies, 13353 Berlin, Germany.
  • Catia Pesquita
    Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Portugal.