ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
Journal:
arXiv
Published Date:
Apr 1, 2025
Abstract
Academic writing requires both coherent text generation and precise citation
of relevant literature. Although recent Retrieval-Augmented Generation (RAG)
systems have significantly improved factual accuracy in general-purpose text
generation, their ability to support professional academic writing remains
limited. In this work, we introduce ScholarCopilot, a unified framework
designed to enhance existing large language models for generating professional
academic articles with accurate and contextually relevant citations.
ScholarCopilot dynamically determines when to retrieve scholarly references by
generating a retrieval token [RET], which is then used to query a citation
database. The retrieved references are fed into the model to augment the
generation process. We jointly optimize both the generation and citation tasks
within a single framework to improve efficiency. Our model is built upon
Qwen-2.5-7B and trained on 500K papers from arXiv. It achieves a top-1
retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines
such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000
academic writing samples, ScholarCopilot scores 16.2/25 in generation quality
-- measured across relevance, coherence, academic rigor, completeness, and
innovation -- significantly surpassing all existing models, including much
larger ones like the Retrieval-Augmented Qwen2.5-72B-Instruct. Human studies
further demonstrate that ScholarCopilot, despite being a 7B model,
significantly outperforms ChatGPT, achieving 100% preference in citation
quality and over 70% in overall usefulness.