A Cytology Dataset for Early Detection of Oral Squamous Cell Carcinoma
Journal:
arXiv
Published Date:
Jun 11, 2025
Abstract
Oral squamous cell carcinoma OSCC is a major global health burden,
particularly in several regions across Asia, Africa, and South America, where
it accounts for a significant proportion of cancer cases. Early detection
dramatically improves outcomes, with stage I cancers achieving up to 90 percent
survival. However, traditional diagnosis based on histopathology has limited
accessibility in low-resource settings because it is invasive,
resource-intensive, and reliant on expert pathologists. On the other hand, oral
cytology of brush biopsy offers a minimally invasive and lower cost
alternative, provided that the remaining challenges, inter observer variability
and unavailability of expert pathologists can be addressed using artificial
intelligence. Development and validation of robust AI solutions requires access
to large, labeled, and multi-source datasets to train high capacity models that
generalize across domain shifts. We introduce the first large and multicenter
oral cytology dataset, comprising annotated slides stained with
Papanicolaou(PAP) and May-Grunwald-Giemsa(MGG) protocols, collected from ten
tertiary medical centers in India. The dataset is labeled and annotated by
expert pathologists for cellular anomaly classification and detection, is
designed to advance AI driven diagnostic methods. By filling the gap in
publicly available oral cytology datasets, this resource aims to enhance
automated detection, reduce diagnostic errors, and improve early OSCC diagnosis
in resource-constrained settings, ultimately contributing to reduced mortality
and better patient outcomes worldwide.