Audio-to-Image Encoding for Improved Voice Characteristic Detection Using Deep Convolutional Neural Networks

Journal: arXiv

Published Date: Mar 7, 2025

Abstract

This paper introduces a novel audio-to-image encoding framework that integrates multiple dimensions of voice characteristics into a single RGB image for speaker recognition. In this method, the green channel encodes raw audio data, the red channel embeds statistical descriptors of the voice signal (including key metrics such as median and mean values for fundamental frequency, spectral centroid, bandwidth, rolloff, zero-crossing rate, MFCCs, RMS energy, spectral flatness, spectral contrast, chroma, and harmonic-to-noise ratio), and the blue channel comprises subframes representing these features in a spatially organized format. A deep convolutional neural network trained on these composite images achieves 98% accuracy in speaker classification across two speakers, suggesting that this integrated multi-channel representation can provide a more discriminative input for voice recognition tasks.

Authors

Youness Atif

External Resources

View on arXiv arXiv (http://arxiv.org/abs/2503.05929v1)

Audio-to-Image Encoding for Improved Voice Characteristic Detection Using Deep Convolutional Neural Networks

Abstract

Authors

Categories

External Resources

Popular Topics

Recent Journals

Audio-to-Image Encoding for Improved Voice Characteristic Detection Using Deep Convolutional Neural Networks

Abstract

Authors

Categories

External Resources

Don't Miss the Future of Medicine

Popular Topics

Recent Journals