Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection
Journal:
arXiv
Published Date:
Mar 12, 2025
Abstract
Visual instructions for long-horizon tasks are crucial as they intuitively
clarify complex concepts and enhance retention across extended steps. Directly
generating a series of images using text-to-image models without considering
the context of previous steps results in inconsistent images, increasing
cognitive load. Additionally, the generated images often miss objects or the
attributes such as color, shape, and state of the objects are inaccurate. To
address these challenges, we propose LIGER, the first training-free framework
for Long-horizon Instruction GEneration with logic and attribute
self-Reflection. LIGER first generates a draft image for each step with the
historical prompt and visual memory of previous steps. This step-by-step
generation approach maintains consistency between images in long-horizon tasks.
Moreover, LIGER utilizes various image editing tools to rectify errors
including wrong attributes, logic errors, object redundancy, and identity
inconsistency in the draft images. Through this self-reflection mechanism,
LIGER improves the logic and object attribute correctness of the images. To
verify whether the generated images assist human understanding, we manually
curated a new benchmark consisting of various long-horizon tasks.
Human-annotated ground truth expressions reflect the human-defined criteria for
how an image should appear to be illustrative. Experiments demonstrate the
visual instructions generated by LIGER are more comprehensive compared with
baseline methods.