Interpretable Steering of Large Language Models with Feature Guided Activation Additions
Journal:
arXiv
Published Date:
Jan 17, 2025
Abstract
Effective and reliable control over large language model (LLM) behavior is a
significant challenge. While activation steering methods, which add steering
vectors to a model's hidden states, are a promising approach, existing
techniques often lack precision and interpretability in how they influence
model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel
activation steering method that leverages insights from Contrastive Activation
Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating
in the latent space of a Sparse Autoencoder (SAE) and employing optimization
techniques to select desired SAE features, FGAA constructs precise steering
vectors that provide better steering effects while maintaining coherence of
steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B
models across various steering tasks demonstrate that FGAA outperforms existing
steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also
highlight important trade-offs between steering scale and general model
capabilities that are consistent across all tested steering methods.