Andrew W. Senior
Andrew Senior received his PhD from the University of
Cambridge, for his thesis "Offline cursive handwriting recognition with recurrent neural networks", having previously worked on speech recognition at LIMSI at the University of Paris XI. He joined IBM Research in 1994 where he worked in the areas of handwriting, audio-visual speech, face and fingerprint recognition as well as video privacy and visual tracking.
In 2008 he taught at Columbia University before joining Google Research where he worked on speech recognition with deep neural networks and recurrent neural networks. He has coauthored a "Guide to Biometrics", and over seventy scientific papers; holds forty-six patents. His research interests range across deep learning, speech, computer vision and visual art.
Authored Publications
Sort By
Raw Multichannel Processing Using Deep Neural Networks
Kean Chin
Chanwoo Kim
New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer (2017)
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
View details
Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
Kean Chin
Chanwoo Kim
IEEE /ACM Transactions on Audio, Speech, and Language Processing, 25 (2017), pp. 965 - 979
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction.
%
Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition.
%
We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs.
%
Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model.
View details
WaveNet: A Generative Model for Raw Audio
Aäron van den Oord
Sander Dieleman
Karen Simonyan
Alexander Graves
Nal Kalchbrenner
Koray Kavukcuoglu
Arxiv (2016)
Preview abstract
This paper introduces WaveNet, a deep generative neural network trained end-to-end to model raw audio waveforms, which can be applied to text-to-speech and music generation. Current approaches to text-to-speech are focused on non-parametric, example-based generation (which stitches together short audio signal segments from a large training set), and parametric, model-based generation (in which a model generates acoustic features synthesized into a waveform with a vocoder). In contrast, we show that directly generating wideband audio signals at tens of thousands of samples per second is not only feasible, but also achieves results that significantly outperform the prior art. A single trained WaveNet can be used to generate different voices by conditioning on the speaker identity. We also show that the same approach can be used for music audio generation and speech recognition.
View details
Preview abstract
We present a new procedure to train acoustic models from scratch for large vocabulary speech
recognition requiring no previous model for alignments or boot-strapping.
We augment the Connectionist Temporal Classification (CTC) objective function to allow training of acoustic models directly
from a parallel corpus of audio data and transcribed data. With this augmented CTC function
we train a phoneme recognition acoustic model directly from the written-domain transcript. Further,
we outline a mechanism to generate a context-dependent phonemes from a CTC model trained to predict phonemes
and ultimately train a second CTC model to predict these context-dependent phonemes. Since this approach does not
require training of any previous non-CTC model it drastically reduces the overall data-to-model training time from
30 days to 10 days. Additionally, models obtain from this flatstart-CTC procedure outperform the state-of-the-art by XX-XX\%.
View details
Learning acoustic frame labeling for speech recognition with recurrent neural networks
Preview
Ozan Irsoy
Alex Graves
Françoise Beaufays
Johan Schalkwyk
ICASSP (2015), pp. 4280-4284
Preview abstract
Both Convolutional Neural Networks (CNNs) and Long Short-Term
Memory (LSTM) have shown improvements over Deep Neural Networks
(DNNs) across a wide variety of speech recognition tasks.
CNNs, LSTMs and DNNs are complementary in their modeling
capabilities, as CNNs are good at reducing frequency variations,
LSTMs are good at temporal modeling, and DNNs are appropriate
for mapping features to a more separable space. In this paper, we
take advantage of the complementarity of CNNs, LSTMs and DNNs
by combining them into one unified architecture. We explore the
proposed architecture, which we call CLDNN, on a variety of large
vocabulary tasks, varying from 200 to 2,000 hours. We find that
the CLDNN provides a 4-6% relative improvement in WER over an
LSTM, the strongest of the three individual models.
View details
Preview abstract
This paper describes a series of experiments to extend the application of Context-Dependent (CD) long short-term memory (LSTM) recurrent neural networks (RNNs) trained with Connectionist Temporal Classification (CTC) and sMBR loss. Our experiments, on a noisy, reverberant voice search task, include training with alternative pronunciations and the application to child speech recognition; combination of multiple models, and convolutional input layers. We also investigate the latency of CTC models and show that constraining forward-backward alignment in training can reduce the delay for a real-time streaming speech recognition system. Finally we investigate transferring knowledge from one network to another through alignments
View details
Large Vocabulary Automatic Speech Recognition for Children
Melissa Carroll
Noah Coccaro
Qi-Ming Jiang
Interspeech (2015)
Preview abstract
Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.
View details