Image-grounded encoding models reveal distinct temporal profiles of naturalistic object and scene processing in the human brain
Journal:
bioRxiv
Published Date:
Apr 24, 2026
Abstract
The human brain processes large amounts of complex visual information in order to effectively interact with its environment. It is well established that our visual system has specialized regions to process incoming information efficiently, such as scene-, face-, and object-selective areas, which can be uncovered using functional magnetic resonance imaging (fMRI). However, mapping of specialized visual processing is typically done with experimental stimuli in which visual information is artificially separated (e.g.~scene backgrounds vs. isolated object cutouts), and fMRI signals are insensitive to potential fine-grained temporal differences in visual information processing. Here, we identify temporal signatures of neural object and scene processing in real-world visual environments by building image-grounded brain-predictive encoding models of human electroencephalography (EEG) responses. In a large set of high-resolution natural images, we separate object from scene information on a per-image basis, and then feed this information to separate deep neural network-based encoding models to predict EEG responses to intact natural images. We find that encoding models that receive only object information consistently exhibit a delayed temporal encoding profile compared to models that only receive scene information. Control analyses confirm the robustness of this delayed object encoding, showing that consistent selection of object or scene information is needed to achieve high encoding performance. Using these distinct encoding profiles as templates, we identify the typicality of individual object classes and scene elements and determine how they are represented in human EEG recordings. Overall, our results show that temporally-resolved recording during intact natural image viewing allows us to delineate distinct temporal profiles of the processing of specific visual elements in complex real-world environments. These findings demonstrate that image-grounded encoding models are a powerful tool for isolating components of naturalistic perception.