MVQ:Towards Efficient DNN Compression and Acceleration with Masked Vector Quantization
Journal:
arXiv
Published Date:
Dec 13, 2024
Abstract
Vector quantization(VQ) is a hardware-friendly DNN compression method that
can reduce the storage cost and weight-loading datawidth of hardware
accelerators. However, conventional VQ techniques lead to significant accuracy
loss because the important weights are not well preserved. To tackle this
problem, a novel approach called MVQ is proposed, which aims at better
approximating important weights with a limited number of codewords. At the
algorithm level, our approach removes the less important weights through N:M
pruning and then minimizes the vector clustering error between the remaining
weights and codewords by the masked k-means algorithm. Only distances between
the unpruned weights and the codewords are computed, which are then used to
update the codewords. At the architecture level, our accelerator implements
vector quantization on an EWS (Enhanced weight stationary) CNN accelerator and
proposes a sparse systolic array design to maximize the benefits brought by
masked vector quantization.\\ Our algorithm is validated on various models for
image classification, object detection, and segmentation tasks. Experimental
results demonstrate that MVQ not only outperforms conventional vector
quantization methods at comparable compression ratios but also reduces FLOPs.
Under ASIC evaluation, our MVQ accelerator boosts energy efficiency by
2.3$\times$ and reduces the size of the systolic array by 55\% when compared
with the base EWS accelerator. Compared to the previous sparse accelerators,
MVQ achieves 1.73$\times$ higher energy efficiency.