LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
Journal:
arXiv
Published Date:
Dec 24, 2024
Abstract
As Large Language Models (LLMs) demonstrate exceptional performance across
various domains, deploying LLMs on edge devices has emerged as a new trend.
Quantization techniques, which reduce the size and memory requirements of LLMs,
are effective for deploying LLMs on resource-limited edge devices. However,
existing one-size-fits-all quantization methods often fail to dynamically
adjust the memory requirements of LLMs, limiting their applications to
practical edge devices with various computation resources. To tackle this
issue, we propose Layer-Specific Adaptive Quantization (LSAQ), a system for
adaptive quantization and dynamic deployment of LLMs based on layer importance.
Specifically, LSAQ evaluates the importance of LLMs' neural layers by
constructing top-k token sets from the inputs and outputs of each layer and
calculating their Jaccard similarity. Based on layer importance, our system
adaptively adjusts quantization strategies in real time according to the
computation resource of edge devices, which applies higher quantization
precision to layers with higher importance, and vice versa. {Experimental
results show that LSAQ consistently outperforms the selected quantization
baselines in terms of perplexity and zero-shot tasks. Additionally, it can
devise appropriate quantization schemes for different usage scenarios to
facilitate the deployment of LLMs.