Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters.
Journal:
Bioinformatics (Oxford, England)
PMID:
40152247
Abstract
MOTIVATION: Advances in bacterial promoter predictors based on machine learning have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified in GC-content discrepancies between positive and negative datasets in single-species models. This study aims to investigate whether multiple-species models for promoter classification are inherently biased due to the selection criteria of negative datasets. We further explore whether the generation of synthetic random sequences (SRS) that mimic GC-content distribution of promoters can partly reduce this bias.