Learn to explain transformer via interpretation path by reinforcement learning.

Journal: Neural networks : the official journal of the International Neural Network Society

Published Date: Apr 29, 2025

Abstract

In recent years, the Transformer model has become a key part of many AI systems, making it important to understand how it works. The large parameter size and complex structure of the Transformer make interpretation more difficult and less efficient. Fortunately, there are many internal variables in Transformers that can aid in the explanation process, including attention matrices, gradients, hidden states, and activations between layers. Effectively utilizing these internal factors can help us better understand the decision-making process of Transformers. Also, most existing works focus on one type of these features and fail to investigate the interpretability of different variables in a unified model. To address these issues, this paper introduces a Reinforcement Learning environment in which the agent makes step-by-step modifications to input sequences to construct perturbed samples and gradually reduces the model's confidence in classification labels. The environment guides the agent to choose the token modification strategy along a more targeted interpretation path instead of random sampling, which can significantly improve the interpretation effectiveness. The flexibly designed agent can utilize multiple internal variables or even combinations of variables as observation, allowing for comparisons of their contributions to the model's interpretability. Extensive experiments conducted on three real-world datasets demonstrate the superior performance of our proposed model in both model interpretation and model adversarial attack tasks. We obtain a set of interesting findings, which can inspire further research on Transformer and Transformer-based model interpretation works. The code of this paper is available at https://github.com/niuzaisheng/Learn-to-Explain.

Authors

Runliang Niu

School of Artificial Intelligence, Jilin University, ChangChun 130012, China; Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China.
Qi Wang

Biotherapeutics Discovery Research Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China.
He Kong

School of Artificial Intelligence, Jilin University, ChangChun 130012, China; Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Ministry of Education, China.
Qianli Xing

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, ChangChun, China.
Yi Chang
Philip S Yu

Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60612 USA.

Keywords

Algorithms Artificial Intelligence Humans Machine Learning Neural Networks, Computer Reinforcement, Psychology

External Resources

View on PubMed Access via DOI PubMed (40334320)

Learn to explain transformer via interpretation path by reinforcement learning.

Abstract

Authors

Keywords

External Resources

Popular Topics

Recent Journals

Learn to explain transformer via interpretation path by reinforcement learning.

Abstract

Authors

Keywords

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals