Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study

Journal: medRxiv
Published Date:

Abstract

Clinical coding is a vital yet complex component of healthcare practice. While automated coding systems have advanced significantly, they still rely on imperfect training data, which affects the quality of their predictions. A key issue contributing to this problem, often overlooked in current research, is the presence of errors and undercoding in widely used clinical coding datasets. In this work, we uncover substantial undercoding and annotation errors in commonly used datasets and present the first empirical study on their impact on the performance of automated clinical coding algorithms. We develop a three-stage pipeline combining a large language model (LLM)-based coding evidence extractor, a multiclass classifier trained on silver-labeled evidence from the MIMIC-IV [14] dataset, and a verification step using LLMs to assess the validity of each code. This approach reveals that approximately 80% of clinical notes in the MDACE [3] dataset and 86% of notes in CodiEsp [25] are likely to be undercoded or contain errors. Furthermore, correcting the errors leads to a relative improvement of 4% in precision and 7% in recall for the current state-of-the-art clinical coding model, PLM-ICD [9]. These findings make it clear that not only the algorithm, but also dataset integrity, plays a critical role in automated clinical coding. Computing methodologies → Natural language processing; Applied computing → Health informatics.

Authors

  • Supriya Khadka; Xiaorui Jiang; Vasile Palade