Advancing FAIR Data Management through AI-Assisted Curation of Morphological Data Matrices
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Curation of biological and paleontological datasets is a labor-intensive process that requires standardization and validation to ensure data integrity. In particular, manual curation of datasets is prone to human errors such as typographical errors, inconsistent formatting, and incomplete metadata, which hinder reproducibility and compliance with Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. Artificial Intelligence (AI) offers a transformative solution for enhancing research efficiency by automating data validation, improving accuracy, and streamlining curation workflows. This study explores the integration of an AI-assisted curation tool developed for MorphoBank, an open access repository established to enhance standardization and usability of morphological character datasets. Specifically, this work presents an AI tool designed to extract, structure, and standardize morphological character data from published literature into the NEXUS file format, a widely used format for phylogenetic analyses. This tool leverages machine learning techniques, including Large Language Models (LLMs), to automate the extraction of character names and states from text in various formats, reducing manual data entry errors and improving data completeness. The system enables efficient conversion of matrix-only files into complete, machine- and human-readable datasets that include key character metadata. By automating these tasks, the tool significantly accelerates dataset curation while improving accuracy and standardization. This approach increases the FAIRness of the data and offers a scalable framework for extending AI-assisted curation to standardize other biological datasets. Our findings demonstrate the value of AI in scientific data curation and advancing data reuse in paleontology, systematics, and evolutionary biology.