Self-Supervised Event Representations: Towards Accurate, Real-Time Perception on SoC FPGAs
Journal:
arXiv
Published Date:
May 12, 2025
Abstract
Event cameras offer significant advantages over traditional frame-based
sensors. These include microsecond temporal resolution, robustness under
varying lighting conditions and low power consumption. Nevertheless, the
effective processing of their sparse, asynchronous event streams remains
challenging. Existing approaches to this problem can be categorised into two
distinct groups. The first group involves the direct processing of event data
with neural models, such as Spiking Neural Networks or Graph Convolutional
Neural Networks. However, this approach is often accompanied by a compromise in
terms of qualitative performance. The second group involves the conversion of
events into dense representations with handcrafted aggregation functions, which
can boost accuracy at the cost of temporal fidelity. This paper introduces a
novel Self-Supervised Event Representation (SSER) method leveraging Gated
Recurrent Unit (GRU) networks to achieve precise per-pixel encoding of event
timestamps and polarities without temporal discretisation. The recurrent layers
are trained in a self-supervised manner to maximise the fidelity of event-time
encoding. The inference is performed with event representations generated
asynchronously, thus ensuring compatibility with high-throughput sensors. The
experimental validation demonstrates that SSER outperforms aggregation-based
baselines, achieving improvements of 2.4% mAP and 0.6% on the Gen1 and 1 Mpx
object detection datasets. Furthermore, the paper presents the first hardware
implementation of recurrent representation for event data on a System-on-Chip
FPGA, achieving sub-microsecond latency and power consumption between 1-2 W,
suitable for real-time, power-efficient applications. Code is available at
https://github.com/vision-agh/RecRepEvent.