Deep Learning for Differentiating Benign From Malignant Bile Duct Dilation on MRCP: Development and Prospective Evaluation of an Xception-Logistic Regression Ensemble Model.

Journal: Journal of magnetic resonance imaging : JMRI
Published Date:

Abstract

BACKGROUND: Accurate identification of benign and malignant bile duct dilatation (BDD) is needed to determine its management plan. Conventional imaging evaluation is subjective, whereas deep learning (DL) offers potential for automated objective assessment. PURPOSE: To construct and evaluate DL models and ensemble strategies based on magnetic resonance cholangiopancreatography (MRCP) images for identifying benign and malignant BDD. STUDY TYPE: Retrospective and prospective. POPULATION: A retrospective cohort (n = 378; median age, 60 years [range: 14, 90]; 194 male) from two institutions and a prospective cohort (n = 60; median age, 62.5 years [range: 15, 86]; 30 male) were included. Retrospective data were randomly stratified split into training, validation, and internal test sets (2:1:1) and an independent external test set. Benign cases were downsampled to balance class distribution. FIELD STRENGTH/SEQUENCE: 3 T MRCP (3D turbo spin echo: VISTA and SPACE). ASSESSMENT: The primary retrospective endpoint was area under the curve (AUC) across DL algorithms and ensembles. Prospectively, the accuracy, sensitivity, and specificity of the model was compared with those of three radiologists. STATISTICAL TESTS: Group comparisons used Mann-Whitney U and Chi-square tests (p < 0.05). Model performance was evaluated using the Hosmer-Lemeshow test, DeLong's test with Bonferroni correction (α = 0.005), and McNemar's test. RESULTS: The Xception model achieved AUCs of 0.816 (95% CI, 0.788-0.844) on the internal test set and 0.807 (95% CI, 0.779-0.835) on the external test set. The ensemble model incorporating logistic regression yielded higher patient-level AUCs of 0.890 and 0.885, with good calibration (p = 0.109). No significant differences were observed among the five ensemble strategies (minimum adjusted p = 0.62). In the prospective cohort, the model showed 90.0% accuracy, sensitivity, and specificity, comparable to radiologists (76.7%-86.7%) without a significant difference (p = 0.143, 0.302, and 0.774, respectively). DATA CONCLUSION: The Xce-LR model shows potential for automating BDD differentiation using MRCP. TECHNICAL EFFICACY: Stage 2.

Authors

Keywords

No keywords available for this article.