EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models
Journal:
arXiv
Published Date:
Jul 31, 2024
Abstract
The rapid advancements in artificial intelligence (AI), particularly the
Large Language Models (LLMs), have profoundly affected our daily work and
communication forms. However, it is still a challenge to deploy LLMs on
resource-constrained edge devices (such as robots), due to the intensive
computation requirements, heavy memory access, diverse operator types and
difficulties in compilation. In this work, we proposed EdgeLLM to address the
above issues. Firstly, focusing on the computation, we designed mix-precision
processing element array together with group systolic architecture, that can
efficiently support both FP16*FP16 for the MHA block (Multi-Head Attention) and
FP16*INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific
optimization on log-scale structured weight sparsity, has been used to further
increase the efficiency. Secondly, to address the compilation and deployment
issue, we analyzed the whole operators within LLM models and developed a
universal data parallelism scheme, by which all of the input and output
features maintain the same data shape, enabling to process different operators
without any data rearrangement. Then we proposed an end-to-end compiler to map
the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA).
The accelerator achieves 1.91x higher throughput and 7.55x higher energy
efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with
state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better
performance in terms of HBM bandwidth utilization, energy efficiency and LLM
throughput.