Transformer-based feature extraction approach for hematopoietic cancer subtype classification.

Journal: Computers in biology and medicine
Published Date:

Abstract

Accurate classification of hematopoietic cancer subtypes remains challenging due to the multipotent nature of hematopoietic cells and the absence of definitive genetic markers. To address this, we propose a Transformer-based Autoencoder that captures compact and biologically informative embeddings from gene expression data. Specifically, our method employs multi-head self-attention in the encoder to learn complex nonlinear interactions among genes, with a reconstruction decoder that enforces biological feature retention. We benchmarked our approach against four widely-used feature extraction methods-Principal Component Analysis, Non-negative Matrix Factorization, Autoencoder, and Variational Autoencoder-using transcriptomic data from five hematopoietic cancer subtypes in The Cancer Genome Atlas, totaling 2452 samples. Data were split 60:20:20 into training, validation, and test sets with stratification, and feature-extractor hyperparameters were chosen on the validation set. Each method produced 100-dimensional feature vectors, subsequently evaluated using eight multi-class classifiers: Light Gradient Boosting Machine, Extreme Gradient Boosting, Logistic Regression, Random Forest, Decision Tree, Support Vector Machine, and Neural Networks. On the independent test set, the Transformer-based Autoencoder embeddings combined with Light Gradient Boosting Machine achieved F1-score: 0.969, accuracy: 0.986, precision: 0.975, recall: 0.964, specificity: 0.996, G-mean: 0.980, and balanced accuracy: 0.954. For context, we additionally included a supervised tabular Transformer (FT-Transformer) as a reference; while strong, it is not directly comparable to our unsupervised feature extractor. To enhance interpretability and clinical relevance, we applied Shapley Additive exPlanations to identify the twenty most influential genes contributing to subtype discrimination. This analysis revealed key biomarkers related to endoplasmic reticulum function, antigen processing, and ribonucleic acid regulation. These findings demonstrate that transformer-based unsupervised feature extraction substantially improves predictive accuracy and yields valuable biological insights for complex hematologic malignancies. Overall, the study supports attention-driven representation learning for tabular biomedical data and motivates future work in generative/self-supervised representations for gene expression.

Authors

  • Kwang Ho Park
    Database and Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju, Korea.
  • Younghee Lee
    Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States.
  • Wei Ding
    Division of Stem Cell and Tissue Engineering, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu Sichuan, 610041, P.R.China.
  • Kwang Sun Ryu
    Cancer Data Center, National Cancer Control Institute, National Cancer Center, Goyang-si, Gyeonggi-do, Republic of Korea.
  • Keun Ho Ryu
    Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

Keywords

No keywords available for this article.