Enhancing Scientific Reproducibility Through Automated BioCompute Object Creation Using Retrieval-Augmented Generation from Publications
Journal:
arXiv
Published Date:
Sep 23, 2024
Abstract
The exponential growth in computational power and accessibility has
transformed the complexity and scale of bioinformatics research, necessitating
standardized documentation for transparency, reproducibility, and regulatory
compliance. The IEEE BioCompute Object (BCO) standard addresses this need but
faces adoption challenges due to the overhead of creating compliant
documentation, especially for legacy research. This paper presents a novel
approach to automate the creation of BCOs from scientific papers using
Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). We
describe the development of the BCO assistant tool that leverages RAG to
extract relevant information from source papers and associated code
repositories, addressing key challenges such as LLM hallucination and
long-context understanding. The implementation incorporates optimized retrieval
processes, including a two-pass retrieval with re-ranking, and employs
carefully engineered prompts for each BCO domain. We discuss the tool's
architecture, extensibility, and evaluation methods, including automated and
manual assessment approaches. The BCO assistant demonstrates the potential to
significantly reduce the time and effort required for retroactive documentation
of bioinformatics research while maintaining compliance with the standard. This
approach opens avenues for AI-assisted scientific documentation and knowledge
extraction from publications thereby enhancing scientific reproducibility. The
BCO assistant tool and documentation is available at
https://biocompute-objects.github.io/bco-rag/.