Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Journal: PloS one

Published Date: Aug 4, 2016

Abstract

MOTIVATION: First identified as an issue in 1996, duplication in biological databases introduces redundancy and even leads to inconsistency when contradictory information appears. The amount of data makes purely manual de-duplication impractical, and existing automatic systems cannot detect duplicates as precisely as can experts. Supervised learning has the potential to address such problems by building automatic systems that learn from expert curation to detect duplicates precisely and efficiently. While machine learning is a mature approach in other duplicate detection contexts, it has seen only preliminary application in genomic sequence databases.

Authors

Qingyu Chen

Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
Justin Zobel

Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.
Xiuzhen Zhang

School of Science, RMIT University, Melbourne, Australia.
Karin Verspoor

Dept of Computing and Information Systems, School of Engineering, University of Melbourne, Melbourne, Australia.

Keywords

Animals Base Sequence Caenorhabditis elegans Computational Biology Databases, Nucleic Acid Escherichia coli Supervised Machine Learning Zea mays Zebrafish

External Resources

View on PubMed Access via DOI PubMed (27489953)

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals