Transformer-based feature extraction approach for hematopoietic cancer subtype classification.

Journal: Computers in biology and medicine

Published Date: Jan 19, 2026

Abstract

Accurate classification of hematopoietic cancer subtypes remains challenging due to the multipotent nature of hematopoietic cells and the absence of definitive genetic markers. To address this, we propose a Transformer-based Autoencoder that captures compact and biologically informative embeddings from gene expression data. Specifically, our method employs multi-head self-attention in the encoder to learn complex nonlinear interactions among genes, with a reconstruction decoder that enforces biological feature retention. We benchmarked our approach against four widely-used feature extraction methods-Principal Component Analysis, Non-negative Matrix Factorization, Autoencoder, and Variational Autoencoder-using transcriptomic data from five hematopoietic cancer subtypes in The Cancer Genome Atlas, totaling 2452 samples. Data were split 60:20:20 into training, validation, and test sets with stratification, and feature-extractor hyperparameters were chosen on the validation set. Each method produced 100-dimensional feature vectors, subsequently evaluated using eight multi-class classifiers: Light Gradient Boosting Machine, Extreme Gradient Boosting, Logistic Regression, Random Forest, Decision Tree, Support Vector Machine, and Neural Networks. On the independent test set, the Transformer-based Autoencoder embeddings combined with Light Gradient Boosting Machine achieved F1-score: 0.969, accuracy: 0.986, precision: 0.975, recall: 0.964, specificity: 0.996, G-mean: 0.980, and balanced accuracy: 0.954. For context, we additionally included a supervised tabular Transformer (FT-Transformer) as a reference; while strong, it is not directly comparable to our unsupervised feature extractor. To enhance interpretability and clinical relevance, we applied Shapley Additive exPlanations to identify the twenty most influential genes contributing to subtype discrimination. This analysis revealed key biomarkers related to endoplasmic reticulum function, antigen processing, and ribonucleic acid regulation. These findings demonstrate that transformer-based unsupervised feature extraction substantially improves predictive accuracy and yields valuable biological insights for complex hematologic malignancies. Overall, the study supports attention-driven representation learning for tabular biomedical data and motivates future work in generative/self-supervised representations for gene expression.

Authors

Kwang Ho Park

Database and Bioinformatics Laboratory, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju, Korea.
Younghee Lee

Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, United States.
Wei Ding

Division of Stem Cell and Tissue Engineering, Regenerative Medicine Research Center, West China Hospital, Sichuan University, Chengdu Sichuan, 610041, P.R.China.
Kwang Sun Ryu

Cancer Data Center, National Cancer Control Institute, National Cancer Center, Goyang-si, Gyeonggi-do, Republic of Korea.
Keun Ho Ryu

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

Keywords

No keywords available for this article.

External Resources

View on PubMed Access via DOI PubMed (41558385)

Transformer-based feature extraction approach for hematopoietic cancer subtype classification.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Transformer-based feature extraction approach for hematopoietic cancer subtype classification.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals