Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

Journal: Journal of biomedical informatics

Published Date: Nov 22, 2021

Abstract

In the last decade, the widespread adoption of electronic health record documentation has created huge opportunities for information mining. Natural language processing (NLP) techniques using machine and deep learning are becoming increasingly widespread for information extraction tasks from unstructured clinical notes. Disparities in performance when deploying machine learning models in the real world have recently received considerable attention. In the clinical NLP domain, the robustness of convolutional neural networks (CNNs) for classifying cancer pathology reports under natural distribution shifts remains understudied. In this research, we aim to quantify and improve the performance of the CNN for text classification on out-of-distribution (OOD) datasets resulting from the natural evolution of clinical text in pathology reports. We identified class imbalance due to different prevalence of cancer types as one of the sources of performance drop and analyzed the impact of previous methods for addressing class imbalance when deploying models in real-world domains. Our results show that our novel class-specialized ensemble technique outperforms other methods for the classification of rare cancer types in terms of macro F1 scores. We also found that traditional ensemble methods perform better in top classes, leading to higher micro F1 scores. Based on our findings, we formulate a series of recommendations for other ML practitioners on how to build robust models with extremely imbalanced datasets in biomedical NLP applications.

Authors

Kevin De Angeli

Oak Ridge National Lab, Oak Ridge, TN, USA.
Shang Gao

Department of Orthopedics, Orthopedic Center of Chinese PLA, Southwest Hospital, Third Military Medical University, Chongqing, 400038, P.R.China.
Ioana Danciu

Oak Ridge National Laboratory, 1 Bethel Valley Rd, Oak Ridge, TN 37830, USA; Department of Biomedical Informatics, Vanderbilt University, 2525 West End Avenue, Nashville, TN 37203, USA.
Eric B Durbin

University of Kentucky, Lexington, KY.
Xiao-Cheng Wu

Department of Epidemiology, Louisiana State University New Orleans School of Public Health, New Orleans, LA 70112, United States.
Antoinette Stroup

New Jersey State Cancer Registry, Rutgers Cancer Institute of New Jersey, New Brunswick, NJ, 08901, United States of America. Electronic address: nan.stroup@rutgers.edu.
Jennifer Doherty

Utah Cancer Registry, University of Utah School of Medicine, Salt Lake City, UT 84132, United States of America. Electronic address: Jen.Doherty@hci.utah.edu.
Stephen Schwartz

Fred Hutchinson Cancer Research Center, Epidemiology Program, Seattle, WA 98109, USA.
Charles Wiggins

University of New Mexico, Albuquerque, NM 87131, USA.
Mark Damesyn

California Department of Public Health, Sacramento, CA 59814, USA.
Linda Coyle

Information Management Services Inc, Calverton, Maryland, USA.
Lynne Penberthy

Surveillance Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, USA.
Georgia D Tourassi
Hong-Jun Yoon

Keywords

Electronic Health Records Humans Machine Learning Natural Language Processing Neoplasms Neural Networks, Computer

External Resources

View on PubMed Access via DOI PubMed (34823030)

Class imbalance in out-of-distribution datasets: Improving the robustness of the TextCNN for the classification of rare cancer types.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals