Learn to explain transformer via interpretation path by reinforcement learning.
Journal:
Neural networks : the official journal of the International Neural Network Society
Published Date:
Apr 29, 2025
Abstract
In recent years, the Transformer model has become a key part of many AI systems, making it important to understand how it works. The large parameter size and complex structure of the Transformer make interpretation more difficult and less efficient. Fortunately, there are many internal variables in Transformers that can aid in the explanation process, including attention matrices, gradients, hidden states, and activations between layers. Effectively utilizing these internal factors can help us better understand the decision-making process of Transformers. Also, most existing works focus on one type of these features and fail to investigate the interpretability of different variables in a unified model. To address these issues, this paper introduces a Reinforcement Learning environment in which the agent makes step-by-step modifications to input sequences to construct perturbed samples and gradually reduces the model's confidence in classification labels. The environment guides the agent to choose the token modification strategy along a more targeted interpretation path instead of random sampling, which can significantly improve the interpretation effectiveness. The flexibly designed agent can utilize multiple internal variables or even combinations of variables as observation, allowing for comparisons of their contributions to the model's interpretability. Extensive experiments conducted on three real-world datasets demonstrate the superior performance of our proposed model in both model interpretation and model adversarial attack tasks. We obtain a set of interesting findings, which can inspire further research on Transformer and Transformer-based model interpretation works. The code of this paper is available at https://github.com/niuzaisheng/Learn-to-Explain.