A-Eval: A benchmark for cross-dataset and cross-modality evaluation of abdominal multi-organ segmentation.

Journal: Medical image analysis
PMID:

Abstract

Although deep learning has revolutionized abdominal multi-organ segmentation, its models often struggle with generalization due to training on small-scale, specific datasets and modalities. The recent emergence of large-scale datasets may mitigate this issue, but some important questions remain unsolved: Can models trained on these large datasets generalize well across different datasets and imaging modalities? If yes/no, how can we further improve their generalizability? To address these questions, we introduce A-Eval, a benchmark for the cross-dataset and cross-modality Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation, integrating seven datasets across CT and MRI modalities. Our evaluations indicate that significant domain gaps persist despite larger data scales. While increased datasets improve generalization, model performance on unseen data remains inconsistent. Joint training across multiple datasets and modalities enhances generalization, though annotation inconsistencies pose challenges. These findings highlight the need for diverse and well-curated training data across various clinical scenarios and modalities to develop robust medical imaging models. The code and pre-trained models are available at https://github.com/uni-medical/A-Eval.

Authors

  • Ziyan Huang
    Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, China; Shanghai AI Laboratory, Shanghai, China.
  • Zhongying Deng
  • Jin Ye
    Division of Gastroenterology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China.
  • Haoyu Wang
    North Carolina State University, Department of Statistics, Raleigh, North Carolina, USA.
  • Yanzhou Su
    Shanghai Artificial Intelligence Laboratory, Shanghai, 200000, China.
  • Tianbin Li
    State Key Laboratory of Geohazard Prevention and Geoenvironment Protection, Chengdu University of Technology, Chengdu 610059, China.
  • Hui Sun
    Department of Thyroid Surgery, China-Japan Union Hospital of Jilin University, Jilin University, Changchun, China.
  • Junlong Cheng
    College of Information Science and Engineering, Xinjiang University, Urumqi 830000, China; Key Laboratory of software engineering technology, Xinjiang University, China.
  • Jianpin Chen
    School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200000, China.
  • Junjun He
    ShenZhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, People's Republic of China; Shanghai AI Laboratory, Shanghai, People's Republic of China; Shanghai Jiao Tong University, Shanghai, People's Republic of China.
  • Yun Gu
    Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, SEIEE Building 2-427, No. 800, Dongchuan Road, Minhang District, Shanghai, 200240 China.
  • Shaoting Zhang
  • Lixu Gu
  • Yu Qiao
    Department of English and American Studies, RWTH Aachen University, Aachen, North Rhine-Westphalia, Germany.