Clinical text annotation - what factors are associated with the cost of time?

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
Published Date:

Abstract

Building high-quality annotated clinical corpora is necessary for developing statistical Natural Language Processing (NLP) models to unlock information embedded in clinical text, but it is also time consuming and expensive. Consequently, it important to identify factors that may affect annotation time, such as syntactic complexity of the text- to-be-annotated and the vagaries of individual user behavior. However, limited work has been done to understand annotation of clinical text. In this study, we aimed to investigate how factors inherent to the text affect annotation time for a named entity recognition (NER) task. We recruited 9 users to annotate a clinical corpus and recorded annotation time for each sample. Then we defined a set of factors that we hypothesized might affect annotation time, and fitted them into a linear regression model to predict annotation time. The linear regression model achieved an R of 0.611, and revealed eight time-associated factors, including characteristics of sentences, individual users, and annotation order with implications for the practice of annotation, and the development of cost models for active learning research.

Authors

  • Qiang Wei
    School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
  • Amy Franklin
    School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.
  • Trevor Cohen
    University of Washington, Seattle, WA.
  • Hua Xu
    Department of Urology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China.