Clinical Comparable Corpus Describing the Same Subjects with Different Expressions.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Medical artificial intelligence (AI) systems need to learn to recognize synonyms or paraphrases describing the same anatomy, disease, treatment, etc. to better understand real-world clinical documents. Existing linguistic resources focus on variants at the word or sentence level. To handle linguistic variations on a broader scale, we proposed the Medical Text Radiology Report section Japanese version (MedTxt-RR-JA), the first clinical comparable corpus. MedTxt-RR-JA was built by recruiting nine radiologists to diagnose the same 15 lung cancer cases in Radiopaedia, an open-access radiological repository. The 135 radiology reports in MedTxt-RR-JA were shown to contain word-, sentence- and document-level variations maintaining similarity of contents. MedTxt-RR-JA is also the first publicly available Japanese radiology report corpus that would help to overcome poor data availability for Japanese medical AI systems. Moreover, our methodology can be applied widely to building clinical corpora without privacy concerns.

Authors

  • Yuta Nakamura
    Division of Radiology and Biomedical Engineering, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan. yutanakamura-tky@umin.ac.jp.
  • Shouhei Hanaoka
    Department of Radiology, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan.
  • Yukihiro Nomura
    The University of Tokyo Hospital.
  • Naoto Hayashi
    The University of Tokyo Hospital.
  • Osamu Abe
    From the Department of Radiology, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Japan 113-8655.
  • Shunrato Yada
    Nara Institute of Science and Technology, Ikoma, Nara, Japan.
  • Shoko Wakamiya
    Nara Institute of Science and Technology (NAIST), Japan.
  • Eiji Aramaki
    Nara Institute of Science and Technology (NAIST), Japan.