How to gain valuable insight from scarce data with Machine Learning: a post-hoc explanation tool to identify biases in biological images classification

Journal: bioRxiv
Published Date:

Abstract

Machine learning (ML) models are effective at classifying images across various fields, including biology. However, their performance on biomedical images is often limited by the small size of available datasets that are constrained by the time-consuming and costly nature of experimental data collection. A review of the literature shows that many studies using biomedical images fail to follow ML best practices. This study focuses on regenerative medicine, which aims to promote tissue regeneration rather than scarring. To explore this process, we applied ML to a limited dataset of images of mice tissues, aiming to distinguish between regenerating and scarring samples. As expected binary classification failed to generalize to independent data. A novel SHAP-based analysis revealed that the overfitting models were based on spurious correlations including individual mice characteristics that aligned with the regeneration/scarring labels. The models appeared to be solving the binary classification task, but were in fact recognizing individuals. To investigate this behavior further, we examined the test set confusion matrix of a model trained to identify individual mice. We observed that, beyond individual recognition, individuals were grouped according to the time elapsed after injury (day 3 or 10) and the healing outcome (regeneration or scarring). We hypothesized that these groupings were based on relevant biological information captured by the model. To test this hypothesis, we successfully trained a model to classify images according to the time elapsed after injury (3 or 10 days), demonstrating that ML can extract relevant biological information when the task is aligned with what the data can actually support. Altogether, this study demonstrates that carefully examining explanations of a model is not only an effective way to unveil putative biases but also to extract relevant information from a limited dataset.

Authors

  • Bolut
  • C.; Pacary
  • A.; Pieruccioni
  • L.; Ousset
  • M.; Paupert
  • J.; Casteilla
  • L.; Simoncini
  • D.

Categories