Parallel GPU-Enabled Algorithms for SpGEMM on Arbitrary Semirings with Hybrid Communication
Journal:
arXiv
Published Date:
Apr 8, 2025
Abstract
Sparse General Matrix Multiply (SpGEMM) is key for various High-Performance
Computing (HPC) applications such as genomics and graph analytics. Using the
semiring abstraction, many algorithms can be formulated as SpGEMM, allowing
redefinition of addition, multiplication, and numeric types. Today large input
matrices require distributed memory parallelism to avoid disk I/O, and modern
HPC machines with GPUs can greatly accelerate linear algebra computation. In
this paper, we implement a GPU-based distributed-memory SpGEMM routine on top
of the CombBLAS library. Our implementation achieves a speedup of over 2x
compared to the CPU-only CombBLAS implementation and up to 3x compared to PETSc
for large input matrices. Furthermore, we note that communication between
processes can be optimized by either direct host-to-host or device-to-device
communication, depending on the message size. To exploit this, we introduce a
hybrid communication scheme that dynamically switches data paths depending on
the message size, thus improving runtimes in communication-bound scenarios.