Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy
Journal:
arXiv
Published Date:
Feb 7, 2025
Abstract
We introduce Long-VITA, a simple yet effective large multi-modal model for
long-context visual-language understanding tasks. It is adept at concurrently
processing and analyzing modalities of image, video, and text over 4K frames or
1M tokens while delivering advanced performances on short-context multi-modal
tasks. We propose an effective multi-modal training schema that starts with
large language models and proceeds through vision-language alignment, general
knowledge learning, and two sequential stages of long-sequence fine-tuning. We
further implement context-parallelism distributed inference and logits-masked
language modeling head to scale Long-VITA to infinitely long inputs of images
and texts during model inference. Regarding training data, Long-VITA is built
on a mix of 17M samples from public datasets only and demonstrates the
state-of-the-art performance on various multi-modal benchmarks, compared
against recent cutting-edge models with internal data. Long-VITA is fully
reproducible and supports both NPU and GPU platforms for training and testing.
By leveraging our inference designs, Long-VITA models achieve a remarkable 2x
prefill speedup and 4x context length extension in single node with 8 GPUs. We
hope Long-VITA can serve as a competitive baseline and offer valuable insights
for the open-source community in advancing long-context multi-modal
understanding.