Machine learning-based framework for predicting human infection potential of coronavirus associated with tri-amino acid motifs, KIQ and LEP in spike protein

Journal: bioRxiv
Published Date:

Abstract

Assessing the human infection potential of emerging coronaviruses remains a critical challenge for global health preparedness. In this study, we developed a machine learning-based framework to predict the human infection potential of coronaviruses and to identify associated sequence motifs using spike (S) protein sequences. A total of 3,904 complete S protein sequences were collected, annotated as human or non-human infection and encoded using trimer-based k-mer features. Model benchmarking was conducted across 27 machine learning algorithms, followed by hyperparameter optimization of the selected model. Robustness and generalizability were evaluated using k-fold cross-validation and independent external validation. Feature interpretability was further assessed using SHAP analysis to identify sequence determinants associated with infection potential. The Random Forest classifier achieved the best performance, with accuracy, sensitivity, and specificity of 97.8%, 99%, and 97.4%, respectively, and demonstrated stable predictive performance across validation datasets. Notably, the KIQ and LEP motifs were strongly associated with human infection coronaviruses and mapped to the HR1 and N-terminal domain regions of the S protein. Overall, this framework provides a practical approach for risk assessment and surveillance of emerging coronaviruses.

Authors

  • Chanraeng
  • N.; Guo
  • J.; Srisongkram
  • T.; Hinwan
  • Y.; Fransson
  • P.; Sjödin
  • H.; Matsuura
  • Y.; Overgaard
  • H. J.; Panthong
  • W.; Ekalaksananan
  • T.; Pientong
  • C.; Phanthanawiboon
  • S.

Categories