Prediction of Lung Metastasis from Hepatocellular Carcinoma using the SEER Database
Journal:
arXiv
Published Date:
Jan 20, 2025
Abstract
Hepatocellular carcinoma (HCC) is a leading cause of cancer-related
mortality, with lung metastases being the most common site of distant spread
and significantly worsening prognosis. Despite the growing availability of
clinical and demographic data, predictive models for lung metastasis in HCC
remain limited in scope and clinical applicability. In this study, we develop
and validate an end-to-end machine learning pipeline using data from the
Surveillance, Epidemiology, and End Results (SEER) database. We evaluated three
machine learning models (Random Forest, XGBoost, and Logistic Regression)
alongside a multilayer perceptron (MLP) neural network. Our models achieved
high AUROC values and recall, with the Random Forest and MLP models
demonstrating the best overall performance (AUROC = 0.82). However, the low
precision across models highlights the challenges of accurately predicting
positive cases. To address these limitations, we developed a custom loss
function incorporating recall optimization, enabling the MLP model to achieve
the highest sensitivity. An ensemble approach further improved overall recall
by leveraging the strengths of individual models. Feature importance analysis
revealed key predictors such as surgery status, tumor staging, and follow up
duration, emphasizing the relevance of clinical interventions and disease
progression in metastasis prediction. While this study demonstrates the
potential of machine learning for identifying high-risk patients, limitations
include reliance on imbalanced datasets, incomplete feature annotations, and
the low precision of predictions. Future work should leverage the expanding
SEER dataset, improve data imputation techniques, and explore advanced
pre-trained models to enhance predictive accuracy and clinical utility.