Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation
Journal:
arXiv
Published Date:
Mar 14, 2025
Abstract
Traditional image annotation tasks rely heavily on human effort for object
selection and label assignment, making the process time-consuming and prone to
decreased efficiency as annotators experience fatigue after extensive work.
This paper introduces a novel framework that leverages the visual understanding
capabilities of large multimodal models (LMMs), particularly GPT, to assist
annotation workflows. In our proposed approach, human annotators focus on
selecting objects via bounding boxes, while the LMM autonomously generates
relevant labels. This human-AI collaborative framework enhances annotation
efficiency by reducing the cognitive and time burden on human annotators. By
analyzing the system's performance across various types of annotation tasks, we
demonstrate its ability to generalize to tasks such as object recognition,
scene description, and fine-grained categorization. Our proposed framework
highlights the potential of this approach to redefine annotation workflows,
offering a scalable and efficient solution for large-scale data labeling in
computer vision. Finally, we discuss how integrating LMMs into the annotation
pipeline can advance bidirectional human-AI alignment, as well as the
challenges of alleviating the "endless annotation" burden in the face of
information overload by shifting some of the work to AI.