Scaling On-Device GPU Inference for Large Generative Models
Journal:
arXiv
Published Date:
May 1, 2025
Abstract
Driven by the advancements in generative AI, large machine learning models
have revolutionized domains such as image processing, audio synthesis, and
speech recognition. While server-based deployments remain the locus of peak
performance, the imperative for on-device inference, necessitated by privacy
and efficiency considerations, persists. Recognizing GPUs as the on-device ML
accelerator with the widest reach, we present ML Drift--an optimized framework
that extends the capabilities of state-of-the-art GPU-accelerated inference
engines. ML Drift enables on-device execution of generative AI workloads which
contain 10 to 100x more parameters than existing on-device generative AI
models. ML Drift addresses intricate engineering challenges associated with
cross-GPU API development, and ensures broad compatibility across mobile and
desktop/laptop platforms, thereby facilitating the deployment of significantly
more complex models on resource-constrained devices. Our GPU-accelerated ML/AI
inference engine achieves an order-of-magnitude performance improvement
relative to existing open-source GPU inference engines.