Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Journal:
arXiv
Published Date:
Dec 2, 2024
Abstract
Most existing GUI agents typically depend on non-vision inputs like HTML
source code or accessibility trees, limiting their flexibility across diverse
software environments and platforms. Current multimodal large language models
(MLLMs), which excel at using vision to ground real-world objects, offer a
potential alternative. However, they often struggle with accurately localizing
GUI elements -- a critical requirement for effective GUI automation -- due to
the semantic gap between real-world objects and GUI elements. In this work, we
introduce Ponder & Press, a divide-and-conquer framework for general computer
control using only visual input. Our approach combines an general-purpose MLLM
as an 'interpreter', responsible for translating high-level user instructions
into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that
precisely locates GUI elements for action placement. By leveraging a purely
visual input, our agent offers a versatile, human-like interaction paradigm
applicable to a wide range of applications. Ponder & Press locator outperforms
existing models by +22.5% on the ScreenSpot GUI grounding benchmark. Both
offline and interactive agent benchmarks across various GUI environments --
including web pages, desktop software, and mobile UIs -- demonstrate that
Ponder & Press framework achieves state-of-the-art performance, highlighting
the potential of visual GUI agents. Refer to the project homepage
https://invinciblewyq.github.io/ponder-press-page/