Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.

Journal: JCO clinical cancer informatics
Published Date:

Abstract

PURPOSE: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.

Authors

  • Damien Fung
    Department of Computer Science, University of British Columbia, Vancouver, Canada.
  • Gregory Arbour
    Data Science Institute, University of British Columbia, Vancouver, Canada.
  • Krisha Malik
    Faculty of Health Sciences, University of Waterloo, Waterloo, Canada.
  • Kaitlin Muzio
    Faculty of Health Sciences, University of Waterloo, Waterloo, Canada.
  • Raymond Ng
    Data Science Institute, University of British Columbia, Vancouver, Canada.