Neglected Risks: The Disturbing Reality of Children's Images in Datasets and the Urgent Call for Accountability
Journal:
arXiv
Published Date:
Apr 20, 2025
Abstract
Including children's images in datasets has raised ethical concerns,
particularly regarding privacy, consent, data protection, and accountability.
These datasets, often built by scraping publicly available images from the
Internet, can expose children to risks such as exploitation, profiling, and
tracking. Despite the growing recognition of these issues, approaches for
addressing them remain limited. We explore the ethical implications of using
children's images in AI datasets and propose a pipeline to detect and remove
such images. As a use case, we built the pipeline on a Vision-Language Model
under the Visual Question Answering task and tested it on the #PraCegoVer
dataset. We also evaluate the pipeline on a subset of 100,000 images from the
Open Images V7 dataset to assess its effectiveness in detecting and removing
images of children. The pipeline serves as a baseline for future research,
providing a starting point for more comprehensive tools and methodologies.
While we leverage existing models trained on potentially problematic data, our
goal is to expose and address this issue. We do not advocate for training or
deploying such models, but instead call for urgent community reflection and
action to protect children's rights. Ultimately, we aim to encourage the
research community to exercise - more than an additional - care in creating new
datasets and to inspire the development of tools to protect the fundamental
rights of vulnerable groups, particularly children.