The Tsetlin Machine: A "Third Way" in QSAR Modeling.
Journal:
Journal of chemical information and modeling
Published Date:
May 28, 2026
Abstract
Advances in Quantitative Structure Activity Relationship (QSAR) are led by two core paradigms, (1) descriptor engineering, where complex fixed-length vectors of compounds are generated and conventional ML methods are applied to those representations and (2) graphical chemical inputs (e.g., Simplified Molecular Input Line Entry System (SMILES), 2D-graph) being provided to deep learning neural network (NN) models, which construct their own internal representations of molecules and learn iteratively over them. Here we present the Tsetlin Machine (TM)─which combines the accuracy and easy-use of existing rule-based QSAR ML methods (e.g., RF and XGBoost), the iterative learning aspect of NN algorithms and its intrinsic interpretability. The TM uses teams of finite-state automata which capture frequent patterns as propositional logic (clauses) via reinforcement learning. The benchmarking pipeline presented here demonstrates that TM-QSAR coupled with ECFP4 descriptors frequently performs better than existing rule-based QSAR methods for ROC-AUC, PRC-AUC and PPV, with a high capacity for interscaffold generalization. However, due to the binary nature of TM-QSAR, performance it is currently limited when descretised continuous descriptors are used. TM-QSAR demonstrated particularly impressive classification scores for MOR (ROC-AUC = 0.87, PRC-AUC = 0.77) and CYPA4 (ROC-AUC = 0.92, PRC-AUC = 0.63), when compared to RF and XGBoost. Using TM in combination with substructural fingerprinting descriptors allows for an interpretability suite which can be extracted directly from clauses. Here we detail molecule property maps (TM-MPM) to view atom-wise TM-QSAR bioactivity contributions for single molecules and closed-form WAC scores (Weights × Activations × Clauses) for descriptor-wise contributions to regions of predicted chemical space. These methods show strong alignment of TM-QSAR interpretations to known ligand-protein interactions of the MOR target and gives nonlinear, conditional interpretations for greater predicted bioactivity. Given this combination of accuracy, computational efficiency and interpretability, we provide a basis for TM-QSAR to be explored as a standard methodology in virtual screening toolkits.
Authors
Keywords
No keywords available for this article.