Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation
Journal:
arXiv
Published Date:
Feb 19, 2025
Abstract
Human affordance learning investigates contextually relevant novel pose
prediction such that the estimated pose represents a valid human action within
the scene. While the task is fundamental to machine perception and automated
interactive navigation agents, the exponentially large number of probable pose
and action variations make the problem challenging and non-trivial. However,
the existing datasets and methods for human affordance prediction in 2D scenes
are significantly limited in the literature. In this paper, we propose a novel
cross-attention mechanism to encode the scene context for affordance prediction
by mutually attending spatial feature maps from two different modalities. The
proposed method is disentangled among individual subtasks to efficiently reduce
the problem complexity. First, we sample a probable location for a person
within the scene using a variational autoencoder (VAE) conditioned on the
global scene context encoding. Next, we predict a potential pose template from
a set of existing human pose candidates using a classifier on the local context
encoding around the predicted location. In the subsequent steps, we use two
VAEs to sample the scale and deformation parameters for the predicted pose
template by conditioning on the local context and template class. Our
experiments show significant improvements over the previous baseline of human
affordance injection into complex 2D scenes.