Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm.

Journal: Microbial genomics
PMID:

Abstract

The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.

Authors

  • Tao Feng
    School of Pharmacy, Anhui University of Chinese Medicine, Anhui Key Laboratory of Modern Chinese Materia Medica Hefei 230012 People's Republic of China tfeng@mail.scuec.edu.cn wanggk@ahtcm.edu.cn.
  • Xirao Chen
    Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China.
  • Shufang Wu
    State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, No.5 Yiheyuan Road Haidian District, Beijing 100871, China.
  • Waijiao Tang
    Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China.
  • Hongwei Zhou
    Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China.
  • Zhencheng Fang
    State Key Laboratory for Turbulence and Complex Systems and Department of Biomedical Engineering, College of Engineering, Peking University, No.5 Yiheyuan Road Haidian District, Beijing 100871, China.