Using Machine Learning for the Fusion of Tumor Records on a Real-World Dataset.

Journal: Studies in health technology and informatics
Published Date:

Abstract

Cancer registries collect multiple reports describing the same tumor, potentially leading to duplicate or conflicting values across different records. This complicates further use of cancer data. Data fusion addresses this issue by consolidating multiple records into a single record for each tumor. We use an artificial neural network (ANN) and compare the performance with a deterministic rule-based approach to merge multiple records per tumor. We accomplish this using a tabular real-world dataset provided by the Cancer Registry of Rhineland-Palatinate in Germany, including colorectal, breast and prostate cancer. The performance of both approaches is evaluated based on the macro F1 score. We find that ANNs outperform the deterministic rule-based approach. In addition, we observe that the performance depends on the number of features and the distribution of data. For both data fusion approaches, the macro F1 score increases with a lower number of categories within a variable and a more balanced dataset.

Authors

  • Clarissa Krämer
    Johannes Gutenberg University, Mainz, Germany.
  • Susanne Schmitt
    Johannes Gutenberg University, Mainz, Germany.
  • Franz Rothlauf
    Johannes Gutenberg University, Mainz, Germany.