A Systematic Review of Open Datasets Used in Text-to-Image (T2I) Gen AI Model Safety
Journal:
arXiv
Published Date:
Feb 23, 2025
Abstract
Novel research aimed at text-to-image (T2I) generative AI safety often relies
on publicly available datasets for training and evaluation, making the quality
and composition of these datasets crucial. This paper presents a comprehensive
review of the key datasets used in the T2I research, detailing their collection
methods, compositions, semantic and syntactic diversity of prompts and the
quality, coverage, and distribution of harm types in the datasets. By
highlighting the strengths and limitations of the datasets, this study enables
researchers to find the most relevant datasets for a use case, critically
assess the downstream impacts of their work given the dataset distribution,
particularly regarding model safety and ethical considerations, and also
identify the gaps in dataset coverage and quality that future research may
address.