Active label cleaning for improved dataset quality under resource constraints.

Journal: Nature communications
PMID:

Abstract

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation-which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.

Authors

  • Mélanie Bernhardt
    Health Intelligence, Microsoft Research Cambridge, Cambridge, CB1 2FB, UK.
  • Daniel C Castro
  • Ryutaro Tanno
    Centre for Medical Image Computing and Department of Computer Science, UCL, Gower Street, London WC1E 6BT, UK; Healthcare Intelligence, Microsoft Research Cambridge, UK. Electronic address: r.tanno@cs.ucl.ac.uk.
  • Anton Schwaighofer
    Health Intelligence, Microsoft Research, Cambridge, United Kingdom.
  • Kerem C Tezcan
  • Miguel Monteiro
  • Shruthi Bannur
    Health Intelligence, Microsoft Research Cambridge, Cambridge, CB1 2FB, UK.
  • Matthew P Lungren
  • Aditya Nori
    Microsoft Research Cambridge, Cambridge, United Kingdom.
  • Ben Glocker
    Kheiron Medical Technologies, London, UK.
  • Javier Alvarez-Valle
    Health Intelligence, Microsoft Research, Cambridge, United Kingdom.
  • Ozan Oktay