Classifying Tumor Reportability Status From Unstructured Electronic Pathology Reports Using Language Models in a Population-Based Cancer Registry Setting.

Journal: JCO clinical cancer informatics
PMID:

Abstract

PURPOSE: Population-based cancer registries (PBCRs) collect data on all new cancer diagnoses in a defined population. Data are sourced from pathology reports, and the PBCRs rely on manual and rule-based solutions. This study presents a state-of-the-art natural language processing (NLP) pipeline, built by fine-tuning pretrained language models (LMs). The pipeline is deployed at the British Columbia Cancer Registry (BCCR) to detect reportable tumors from a population-based feed of electronic pathology.

Authors

  • Lovedeep Gondara
    British Columbia Cancer Agency, Vancouver, BC, Canada.
  • Jonathan Simkin
    British Columbia Cancer Registry, Provincial Health Services Authority, Vancouver, Canada.
  • Gregory Arbour
    Data Science Institute, University of British Columbia, Vancouver, Canada.
  • Shebnum Devji
    British Columbia Cancer Registry, Provincial Health Services Authority, Vancouver, Canada.
  • Raymond Ng
    Data Science Institute, University of British Columbia, Vancouver, Canada.