Low-Bit Integerization of Vision Transformers using Operand Reodering for Efficient Hardware
Journal:
arXiv
Published Date:
Apr 11, 2025
Abstract
Pre-trained vision transformers have achieved remarkable performance across
various visual tasks but suffer from expensive computational and memory costs.
While model quantization reduces memory usage by lowering precision, these
models still incur significant computational overhead due to the dequantization
before matrix operations. In this work, we analyze the computation graph and
propose an integerization process based on operation reordering. Specifically,
the process delays dequantization until after matrix operations. This enables
integerized matrix multiplication and linear module by directly processing the
quantized input. To validate our approach, we synthesize the self-attention
module of ViT on a systolic array-based hardware. Experimental results show
that our low-bit inference reduces per-PE power consumption for linear layer
and matrix multiplication, bridging the gap between quantized models and
efficient inference.