Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique
Journal:
arXiv
Published Date:
Apr 10, 2025
Abstract
This paper addresses emulation algorithms for matrix multiplication. General
Matrix-Matrix Multiplication (GEMM), a fundamental operation in the Basic
Linear Algebra Subprograms (BLAS), is typically optimized for specific hardware
architectures. The Ozaki scheme is a well-established GEMM-based emulation
method for matrix multiplication, wherein input matrices are decomposed into
several low-precision components to ensure that the resulting matrix product is
computed exactly through numerical operations. This study proposes a novel
GEMM-based emulation method for matrix multiplication that leverages the
Chinese Remainder Theorem. The proposed method inherits the computational
efficiency of highly optimized GEMM routines and further enables control over
the number of matrix multiplications, which can enhance computational accuracy.
We present numerical experiments featuring INT8 Tensor Core operations on GPUs
and FP64 arithmetic on CPUs as case studies. The results demonstrate that FP64
emulation using the proposed method achieves performance levels of up to 7.4 to
9.8 TFLOPS on the NVIDIA RTX 4090 and 56.6 to 80.2 TFLOPS on the NVIDIA GH200,
exceeding the measured performance of native FP64 arithmetic. Furthermore, for
FP64 computations on CPUs, the proposed method achieved up to a 2.3x speedup in
emulating quadruple-precision arithmetic compared to the conventional Ozaki
scheme.