Entropy Fusion DNA: Alignment-Free Gene Fusion Detection through Entropy and Mutual Information Descriptors
Journal:
bioRxiv
Published Date:
May 30, 2026
Abstract
Gene fusions are clinically relevant genomic alterations and key cancer biomarkers. Their computational detection remains dominated by alignment-based pipelines, whose reliance on read mapping, reference annotations, and heuristic filtering makes them sensitive to mapping ambiguities, annotation incompleteness, repetitive regions, and false positives. Recent machine learning (ML) strategies aim to learn fusion-related patterns directly from sequencing data, but their adoption is still limited by dataset-specific biases, synthetic data artifacts, class imbalance, and representations that may overlook the structural organization of biological sequences. Theoretical and statistical sequence descriptors remain underexplored as efficient tools for capturing informative structural signals in biological reads. In this work, we investigate whether fusion-related information can be inferred directly from the statistical organization of DNA sequences. Each sequence is encoded into a compact, interpretable, and alignment-free feature space combining Shannon and Renyi entropy, lagged and base-resolved mutual information, GC content, and rarefied k-mer richness descriptors. Our goal is to assess whether these information-theoretic features encode discriminative sequence signatures associated with fusion events. For discriminating fusion-derived from non-fusion sequences, nested cross-validation selected K-nearest neighbors as the most effective classifier, achieving strong held-out performance on the balanced benchmark (AUROC = 0.892, AUPRC = 0.865). The same representation was then evaluated on fusion-positive samples for fusion partner prediction and breakpoint localization, achieving strong top-k partner identification accuracy and stable breakpoint regression performance. Moreover, a two-stage strategy in which the binary classifier first filters candidate reads further improved partner prediction, suggesting its use as an enrichment step for downstream fusion characterization. Although performance decreased under repeated fusion-pair-disjoint evaluation, it remained clearly above random expectation, supporting the transferability of the proposed descriptors to unseen fusion pairs. Breakpoint-centered validation further revealed increased local sequence complexity, altered short-range dependency structure, and modest but significant microhomology enrichment around fusion regions. Such findings support an interpretable alignment-free framework where information-theoretic features provide predictive and biologically informative signals for gene fusion analysis.