FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos
Journal:
arXiv
Published Date:
Jun 8, 2025
Abstract
Real-world surveillance often renders faces and license plates unrecognizable
in individual low-resolution (LR) frames, hindering reliable identification. To
advance temporal recognition models, we present FANVID, a novel video-based
benchmark comprising nearly 1,463 LR clips (180 x 320, 20--60 FPS) featuring 63
identities and 49 license plates from three English-speaking countries. Each
video includes distractor faces and plates, increasing task difficulty and
realism. The dataset contains 31,096 manually verified bounding boxes and
labels.
FANVID defines two tasks: (1) face matching -- detecting LR faces and
matching them to high-resolution mugshots, and (2) license plate recognition --
extracting text from LR plates without a predefined database. Videos are
downsampled from high-resolution sources to ensure that faces and text are
indecipherable in single frames, requiring models to exploit temporal
information. We introduce evaluation metrics adapted from mean Average
Precision at IoU > 0.5, prioritizing identity correctness for faces and
character-level accuracy for text.
A baseline method with pre-trained video super-resolution, detection, and
recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate
recognition), highlighting both the feasibility and challenge of the tasks.
FANVID's selection of faces and plates balances diversity with recognition
challenge. We release the software for data access, evaluation, baseline, and
annotation to support reproducibility and extension. FANVID aims to catalyze
innovation in temporal modeling for LR recognition, with applications in
surveillance, forensics, and autonomous vehicles.