PathologyBERT - Pre-trained Vs. A New Transformer Language Model for Pathology Domain.

Journal: AMIA ... Annual Symposium proceedings. AMIA Symposium
Published Date:

Abstract

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface repository. Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

Authors

  • Thiago Santos
    Emory University, Department of Computer Science, Atlanta, Georgia, USA.
  • Amara Tariq
    Department of Biomedical Informatics, Emory School of Medicine, Atlanta, Georgia. Electronic address: amara.tariq2@emory.edu.
  • Susmita Das
    Indian Institute of Technology (IIT), Centre of Excellence in Artificial Intelligence, Kharagpur, West Bengal, India.
  • Kavyasree Vayalpati
    Arizona State University, School of Computing and Augmented Intelligence, Tempe, Arizona, USA.
  • Geoffrey H Smith
    Emory University, Department of Pathology, Atlanta, Georgia, USA.
  • Hari Trivedi
    Department of Radiology, Medical College of Georgia at Augusta University, 1120 15th St, Augusta, GA 30912 (Y.T.); and Department of Radiology, Emory University, Atlanta, Ga (B.V., E.K., A.P., J.G., N.S., H.T.).
  • Imon Banerjee
    Mayo Clinic, Department of Radiology, Scottsdale, AZ, USA.