Toward Faithful Neural Network Intrinsic Interpretation With Shapley Additive Self-Attribution.
Journal:
IEEE transactions on neural networks and learning systems
Published Date:
May 29, 2025
Abstract
Self-interpreting neural networks have attracted significant attention from the research community. Along this line, extensive works inherently share the intuitive principle of linear contribution aggregation from diversified perspectives, while often: 1) lacking a solid theoretical foundation ensuring genuine interpretability and 2) compromising model expressiveness. In response, we propose a generic additive self-attribution (ASA) framework to encapsulate the characteristics of various works in this field and underscore the absence of the Shapley value attribution. To fill in this gap, we propose a novel Shapley additive self-attributing neural network (SASANet). SASANet models meaningful outputs for arbitrary-numbered observable features, naturally leading to an unapproximated value function for Shapely value. Designing an intermediate sequential schema based on marginal contributions (MCs) and internal distillation procedure, we theoretically prove that the intermediate self-attribution value converging to the output's Shapley values. Finally, we conduct extensive experiments on multiple public datasets. The experimental results clearly demonstrate SASANet, being highly interpretable, outperforms existing self-attributing models in performance and is comparable with commonly adopted closed-box models. In addition, compared with adopting post hoc interpretation methods, SASANet's self-attribution provides a more accurate and efficient interpretation for its own predictions. To the best of the authors' knowledge, this is the first self-interpreting neural network structure that achieves modelwise Shapley attribution. Our code is available at: https://anonymous.4open.science/r/SASANet-B343.
Authors
Keywords
No keywords available for this article.