Harnessing Contextual Embeddings: A Deep Learning Framework for Predicting PCR Amplification Using BERT Tokenization
Journal:
bioRxiv
Published Date:
Jan 1, 2025
Abstract
Polymerase Chain Reaction (PCR) is a widely used molecular biology technique to amplify DNA sequences. PCR amplification is affected by factors such as binding dynamics and primer-template interactions. This study aims to reduce the time and cost of the experiment by predicting PCR outcomes based on these factors. To achieve this, we first identify the most stable binding sites for each primer-template pair by calculating the Gibbs free energy. Then, we propose a unique labelling strategy that captures primer-template interactions in the binding sites by analyzing match and mismatch positions. We categorize a set of English words into two semantically distinct groups: one for match positions and another for mismatch positions. Words within each group have a higher cosine similarity to one another than to words in the opposing group. We assign the corresponding word to each base pair based on whether it is a match or a mismatch. The labelled sequence is then tokenized with BERT, serving as input to an CNN-BiLSTM model. Achieving 84.8% accuracy, this approach outperforms prior methods and pioneers BERT-based analysis in primer-template bindings. Crucially, the model also demonstrates significantly better sensitivity, specificity, and Area Under the ROC Curve (AUC) compared to prior work, indicating a more robust capability to correctly distinguish both successful and failed PCR outcomes, which is vital for reliable experimental prediction. Selecting the most important features for PCR amplification using Random Forest Classifier Proposing a new labelling approach to represent the matches and mismatches between PCR primers and templates Using BERT tokenizer to tokenize the corresponding representation of matches and mis-matches Augmenting the data based on the semantic similarities of the words in the BERT tokenizer Using CNN-BiLSTM to predict PCR amplification results