Recent advances in conversational speech recognition using convolutional and recurrent neural networks

George Saon; Michael Picheny

doi:10.1147/JRD.2017.2701178

IBM J. Res. Dev

Paper

01 Jul 2017

Recent advances in conversational speech recognition using convolutional and recurrent neural networks

View publication

Abstract

Deep learning methodologies have had a major impact on performance across a wide variety of machine learning tasks, and speech recognition is no exception. We describe a set of deep learning techniques that proved to be particularly successful in achieving performance gains in word error rate on a popular large vocabulary conversational speech recognition benchmark task ("Switchboard"). We found that the best performance is achieved by combining features from both recurrent and convolutional neural networks. We compare two recurrent architectures: partially unfolded nets with max-out activations and bidirectional long short-term memory nets. In addition, inspired by the success of convolutional networks for image classification, we designed a convolutional net with many convolutional layers and small kernels that create a receptive field with more nonlinearity and fewer parameters than standard configurations. When combined, these neural networks achieve a word error rate of 6.2% on this difficult task; this was the best reported rate at the time of this writing and is even more remarkable given that human performance itself is estimated to be 4% on this data.

Paper