SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
Journal:
arXiv
Published Date:
Jan 3, 2025
Abstract
As advancements in large language models (LLMs) continue and the demand for
personalized models increases, parameter-efficient fine-tuning (PEFT) methods
(e.g., LoRA) will become essential due to their efficiency in reducing
computation costs. However, recent studies have raised alarming concerns that
LoRA fine-tuning could potentially compromise the safety alignment in LLMs,
posing significant risks for the model owner. In this paper, we first
investigate the underlying mechanism by analyzing the changes in safety
alignment related features before and after fine-tuning. Then, we propose a
fixed safety module calculated by safety data and a task-specific
initialization for trainable parameters in low-rank adaptations, termed
Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA
methods and their variants, SaLoRA enables targeted modifications to LLMs
without disrupting their original alignments. Our experiments show that SaLoRA
outperforms various adapters-based approaches across various evaluation metrics
in different fine-tuning tasks.