Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration
Journal:
arXiv
Published Date:
Apr 28, 2025
Abstract
Modern LLM serving systems confront inefficient GPU utilization due to the
fundamental mismatch between compute-intensive prefill and memory-bound decode
phases. While current practices attempt to address this by organizing these
phases into hybrid batches, such solutions create an inefficient tradeoff that
sacrifices either throughput or latency, leaving substantial GPU resources
underutilized. We identify two key root causes: 1) the prefill phase suffers
from suboptimal compute utilization due to wave quantization and attention
bottlenecks. 2) hybrid batches disproportionately prioritize latency over
throughput, resulting in wasted compute and memory bandwidth. To mitigate the
issues, we present Bullet, a novel spatial-temporal orchestration system that
eliminates these inefficiencies through precise phase coordination. Bullet
enables concurrent execution of prefill and decode phases, while dynamically
provisioning GPU resources using real-time performance modeling. By integrating
SLO-aware scheduling and adaptive resource allocation, Bullet maximizes
utilization without compromising latency targets. Experimental evaluations on
real-world workloads demonstrate that Bullet delivers 1.26x average throughput
gains (up to 1.55x) over state-of-the-arts, while consistently meeting latency
constraints.