High-Fidelity Data Retrieval from Synthetic DNA Pools via Machine Learning Model.

Journal: Small (Weinheim an der Bergstrasse, Germany)
Published Date:

Abstract

Synthetic DNA offers significant benefits, remarkable information density and long-term stability for data storage. The practical success of DNA data storage, however, depends on solving the challenge of selective retrieval-accessing specific data with high fidelity from complex DNA mixtures. Here, we present a machine learning method to achieve high-fidelity, isothermal selective data retrieval from synthetic DNA pools. We designed a toehold-triggered isothermal DNA storage, wherein each data sequence is indexed by a unique stem-loop "lock" sequence. Complementary "key" oligos are designed to unlock these lock-sequences. A machine learning model was trained on a comprehensive dataset generated from a synthetic DNA pool comprising 12,000 diverse 8-nt lock sequences. This library was designed to encompass nearly complete theoretical sequence diversity of 8-nt sequences, enabling the model to learn nucleotide sequence recognition specificity beyond conventional thermodynamic hybridization principles. As a result, the machine learning model generated high specific keys achieving a maximum improvement of 292-fold in the signal-to-noise ratio during the amplification of selected data sequences from a complex synthetic DNA pool. Thus, this isothermal selective retrieval method enables practical, low-energy DNA data storage. Concurrently, the machine learning model provides insight into DNA sequence specificity, extending its utility beyond data storage applications.

Authors

Keywords

No keywords available for this article.