Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.

Journal: JCO clinical cancer informatics

Published Date: Mar 4, 2025

Abstract

PURPOSE: Many Natural Language Processing (NLP) methods achieve greater performance when the input text is preprocessed to remove extraneous or unnecessary text. A technique known as text segmentation can facilitate this step by isolating key sections from a document. Give that transformer models-such as Bidirectional Encoder Representations from Transformers (BERT)-have demonstrated state-of-the-art performance on many NLP tasks, it is desirable to leverage such models for segmentation. However, transformer models are typically limited to only 512 input tokens and are not well suited for lengthy documents such as cancer pathology reports. The Longformer is a modified transformer model designed to intake longer documents while retaining the positive characteristics of standard transformers. This study presents a Longformer model fine-tuned for cancer pathology report segmentation.

Authors

Damien Fung

Department of Computer Science, University of British Columbia, Vancouver, Canada.
Gregory Arbour

Data Science Institute, University of British Columbia, Vancouver, Canada.
Krisha Malik

Faculty of Health Sciences, University of Waterloo, Waterloo, Canada.
Kaitlin Muzio

Faculty of Health Sciences, University of Waterloo, Waterloo, Canada.
Raymond Ng

Data Science Institute, University of British Columbia, Vancouver, Canada.

Keywords

Algorithms Electronic Health Records Humans Large Language Models Natural Language Processing Neoplasms

External Resources

View on PubMed Access via DOI PubMed (40036729)

Using a Longformer Large Language Model for Segmenting Unstructured Cancer Pathology Reports.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals