Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics
Journal:
arXiv
Published Date:
Feb 11, 2025
Abstract
How can we identify causal genetic mechanisms that govern bacterial traits?
Initial efforts entrusting machine learning models to handle the task of
predicting phenotype from genotype return high accuracy scores. However,
attempts to extract any meaning from the predictive models are found to be
corrupted by falsely identified "causal" features. Relying solely on pattern
recognition and correlations is unreliable, significantly so in bacterial
genomics settings where high-dimensionality and spurious associations are the
norm. Though it is not yet clear whether we can overcome this hurdle,
significant efforts are being made towards discovering potential high-risk
bacterial genetic variants. In view of this, we set up open problems
surrounding phenotype prediction from bacterial whole-genome datasets and
extending those to learning causal effects, and discuss challenges that impact
the reliability of a machine's decision-making when faced with datasets of this
nature.