Deep Convolutional Neural Networks for large-scale speech tasks.

Journal: Neural networks : the official journal of the International Neural Network Society
Published Date:

Abstract

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize that CNNs are a more effective model for speech compared to Deep Neural Networks (DNNs). In this paper, we explore applying CNNs to large vocabulary continuous speech recognition (LVCSR) tasks. First, we determine the appropriate architecture to make CNNs effective compared to DNNs for LVCSR tasks. Specifically, we focus on how many convolutional layers are needed, what is an appropriate number of hidden units, what is the best pooling strategy. Second, investigate how to incorporate speaker-adapted features, which cannot directly be modeled by CNNs as they do not obey locality in frequency, into the CNN framework. Third, given the importance of sequence training for speech tasks, we introduce a strategy to use ReLU+dropout during Hessian-free sequence training of CNNs. Experiments on 3 LVCSR tasks indicate that a CNN with the proposed speaker-adapted and ReLU+dropout ideas allow for a 12%-14% relative improvement in WER over a strong DNN system, achieving state-of-the art results in these 3 tasks.

Authors

  • Tara N Sainath
    IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: tsainath@google.com.
  • Brian Kingsbury
    IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: bedk@us.ibm.com.
  • George Saon
    IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: gsaon@us.ibm.com.
  • Hagen Soltau
    IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: soltau@google.com.
  • Abdel-rahman Mohamed
    Department of Computer Science, University of Toronto, United States. Electronic address: asamir@cs.toronto.edu.
  • George Dahl
    Department of Computer Science, University of Toronto, United States. Electronic address: gdahl@cs.toronto.edu.
  • Bhuvana Ramabhadran
    IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, United States. Electronic address: bhuvana@us.ibm.com.