FaciaVox: A diverse multimodal biometric dataset of facial images and voice recordings.
Journal:
Data in brief
Published Date:
Mar 21, 2025
Abstract
FaciaVox is a multimodal biometric dataset that consists of face images and voice recordings under both masked and unmasked conditions. The term ``FaciaVox'' is strategically chosen to create a distinct and easily memorable name. This name selection serves to highlight the dataset's multimodal characteristics, as well as its relevance to biometric recognition tasks. The FaciaVox dataset consists of contributions from 100 participants from 20 different countries, each providing 18 facial images and 60 audio recordings. The facial images are stored in JPG format, while the audio recordings are saved as WAV files, ensuring compatibility with standard processing tools. Participants are categorized by age into four distinct groups: Group 1 includes individuals below 16 years of age; Group 2 corresponds to those aged 16 up to less than 31; Group 3 encompasses participants aged 31 up to less than 46; and Group 4 represents individuals aged 46 and above. The data collection was conducted in two distinct environments: a professional soundproof studio and a conventional classroom. While the studio provided a controlled setting, the classroom introduced variables such as echo and sound reflections. Some participants were recorded in the studio, while others were recorded in the classroom, as detailed in the file named 'FaciaVox list' which specifies where each participant was recorded. Participants were positioned at 70-100 cm from the iPhone's rear camera, utilizing three specific zoom levels (1x, 3x, and 5x) to obtain a collection of facial photos. Each participant submitted a total of 18 facial photos, comprising six different images captured at each magnification level. The six different images encompassed a sequence of conditions: the initial set was captured without the use of a face mask, followed by subsequent images where participants donned a disposable mask, transitioned to a reusable mask, then advanced to a dual-layer cloth mask. Subsequently, a silicon face shield was introduced along with the cloth mask, concluding in final images where the silicon shield was worn independently. Each participant was instructed to speak ten sentences, switching between English and Arabic, under the six previously mentioned conditions. The speech was recorded using the Zoom H6 Handy Recorder. The FaciaVox dataset provides an extensive range of study options in the fields of face images and audio signals with and without face mask. This broad dataset serves as a foundational resource for investigating a wide range of cutting-edge applications, including but not limited to multimodal biometrics, cross-domain biometric fusion, age and gender estimation, human-machine interaction, deep learning, speech intelligence, voice cloning, image inpainting, and security and surveillance.
Authors
Keywords
No keywords available for this article.