PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications
Journal:
arXiv
Published Date:
May 12, 2025
Abstract
Besides typical generative applications, like ChatGPT, GitHub Copilot, and
Cursor, we observe an emerging trend that LLMs are increasingly used in
traditional discriminative tasks, such as recommendation, credit verification,
and data labeling. The key characteristic of these emerging use cases is that
the LLM generates only a single output token, rather than an arbitrarily long
sequence of tokens. We call this prefill-only workload. However, since existing
LLM engines assume arbitrary output lengths, they fail to leverage the unique
properties of prefill-only workloads. In this paper, we present PrefillOnly,
the first LLM inference engine that improves the inference throughput and
latency by fully embracing the properties of prefill-only workloads. First,
since it generates only one token, PrefillOnly only needs to store the KV cache
of only the last computed layer, rather than of all layers. This drastically
reduces the GPU memory footprint of LLM inference and allows handling long
inputs without using solutions that reduces throughput, such as cross-GPU KV
cache parallelization. Second, because the output length is fixed, rather than
arbitrary, PrefillOnly can precisely determine the job completion time (JCT) of
each prefill-only request before it starts. This enables efficient JCT-aware
scheduling policies such as shortest remaining job first. PrefillOnly can
process upto 4x larger queries per second without inflating average and P99
latency.