Michiel Bacchiani
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
LibriTTS-R: Restoration of a Large-Scale Multi-Speaker TTS Corpus
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
Interspeech 2023 (2023)
Preview abstract
This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from [URL-HERE]
View details
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech Representation and Linguistic Features
Yifan Ding
Kohei Yatabe
Nobuyuki Morioka
Yu Zhang
Wei Han
WASPAA 2023 (2023) (to appear)
Preview abstract
Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) linguistic features extracted from transcripts and PnG-BERT for conditioning features. Experiments show that the proposed model (i) is robust against various audio degradation, (ii) can restore samples in the LJspeech dataset and improves the quality of text-to-speech (TTS) outputs without changing the model and hyper-parameters, and (iii) enable us to train a high-quality TTS model from restored speech samples collected from the web.
View details
Preview abstract
End-to-end speech recognition is a promising technology for enabling compact automatic speech recognition (ASR) systems since it can unify the acoustic and language model into a single neural network.
However, as a drawback, training of end-to-end speech recognizers always requires transcribed utterances.
Since end-to-end models are also known to be severely data hungry, this constraint is crucial especially because obtaining transcribed utterances is costly and can possibly be impractical or impossible.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
Specifically, this paper attempts to transfer semantic knowledge acquired in embedding vectors of large-scale language models.
Since embedding vectors can be assumed as implicit representations of linguistic information such as part-of-speech, intent, and so on, those are also expected to be useful modeling cues for ASR decoders.
This paper extends two types of ASR decoders, attention-based decoders and neural transducers, by modifying training loss functions to include embedding prediction terms.
The proposed systems were shown to be effective for error rate reduction without incurring extra computational costs in the decoding phase.
View details
WaveFit: An Iterative and Non-autoregressive Neural Vocoder based on Fixed-Point Iteration
Kohei Yatabe
Proc. IEEE Spoken Language Technology Workshop (SLT) (2022) (to appear)
Preview abstract
Denoising diffusion probabilistic models (DDPMs) and generative adversarial networks (GANs) are popular generative models for neural vocoders. The DDPMs and GANs can be characterized by the iterative denoising framework and adversarial training, respectively. This study proposes a fast and high-quality neural vocoder called WaveFit, which integrates the essence of GANs into a DDPM-like iterative framework based on fixed-point iteration. WaveFit iteratively denoises an input signal, and trains a deep neural network (DNN) for minimizing an adversarial loss calculated from intermediate outputs at all iterations. Subjective (side-by-side) listening tests showed no statistically significant differences in naturalness between human natural speech and those synthesized by WaveFit with five iterations. Furthermore, the inference speed of WaveFit was more than 240 times faster than WaveRNN. Audio demos are available at google.github.io/df-conformer/wavefit/.
View details
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
Kohei Yatabe
Proc. Interspeech (2022) (to appear)
Preview abstract
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at [wavegrad.github.io/specgrad/].
View details
SNRi Target Training for Joint Speech Enhancement and Recognition
Sankaran Panchapagesan
Proc. Interspeech (2022) (to appear)
Preview abstract
Speech enhancement (SE) is used as a frontend in speech applications including automatic speech recognition (ASR) and telecommunication. A difficulty in using the SE frontend is that the appropriate noise reduction level differs depending on applications and/or noise characteristics. In this study, we propose ``{\it signal-to-noise ratio improvement (SNRi) target training}''; the SE frontend is trained to output a signal whose SNRi is controlled by an auxiliary scalar input. In joint training with a backend, the target SNRi value is estimated by an auxiliary network. By training all networks to minimize the backend task loss, we can estimate the appropriate noise reduction level for each noisy input in a data-driven scheme. Our experiments showed that the SNRi target training enables control of the output SNRi. In addition, the proposed joint training relatively reduces word error rate by 4.0\% and 5.7\% compared to a Conformer-based standard ASR model and conventional SE-ASR joint training model, respectively. Furthermore, by analyzing the predicted target SNRi, we observed the jointly trained network automatically controls the target SNRi according to noise characteristics. Audio demos are available in our demo page [google.github.io/df-conformer/snri_target/].
View details
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition
Lion Jones
Interspeech 2021 (2021) (to appear)
Preview abstract
End-to-end (E2E) modeling is advantageous for automatic speech recognition (ASR) especially for Japanese since word-based tokenization of Japanese is not trivial, and E2E modeling is able to model character sequences directly. This paper focuses on the latest E2E modeling techniques, and investigates their performances on character-based Japanese ASR by conducting comparative experiments. The results are analyzed and discussed in order to understand the relative advantages of long short-term memory (LSTM), and Conformer models in combination with connectionist temporal classification, transducer, and attention-based loss functions. Furthermore, the paper investigates on effectivity of the recent training techniques such as data augmentation (SpecAugment), variational noise injection, and exponential moving average. The best configuration found in the paper achieved the state-of-the-art character error rates of 4.1%, 3.2%, and 3.5% for Corpus of Spontaneous Japanese (CSJ) eval1, eval2, and eval3 tasks, respectively. The system is also shown to be computationally efficient thanks to the efficiency of Conformer transducers.
View details
DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement
Lion Jones
Proc. IEEE Workshop Appl. Signal Process. Audio Acoust. (WASPAA) (2021)
Preview abstract
Combinations of a trainable filterbank and a mask prediction network is a strong framework in single-channel speech enhancement (SE). Since the denoising performance and computational efficiency are mainly affected by the structure of the mask prediction network, we aim to improve this network. In this study, by focusing on a similarity between the structure of Conv-TasNet and Conformer, we integrate the Conformer into SE as a mask prediction network to benefit its powerful sequential modeling ability. To improve the computational complexity and local sequential modeling, we extend the Conformer using linear complexity attention and stacked 1-D dilated depthwise convolution layers. Experimental results show that (i) the use of linear complexity attention avoids high computational complexity, and (ii) our model achieves higher scale-invariant signal-to-noise ratio than the improved time-dilated convolution network (TDCN++), an extended version of Conv-TasNet.
View details
Preview abstract
This paper proposes methods to improve a commonly used end-to-end speech recognition model, Listen-Attend-Spell (LAS).
The methods we propose use multi-task learning to improve generalization of the model by leveraging information from multiple labels.
The focus in this paper is on multi-task models for simultaneous signal-to-grapheme and signal-to-phoneme conversions while sharing the encoder parameters.
Since phonemes are designed to be a precise description of the linguistic aspects of the speech signal, using phoneme recognition as an auxiliary task can help guiding the early stages of training to be more stable.
In addition to conventional multi-task learning, we obtain further improvements by introducing a method that can exploit dependencies between labels in different tasks. Specifically, the dependencies between phonemes and grapheme sequences are considered. In conventional multi-task learning these sequences are assumed to be independent. Instead, in this paper, a joint model is proposed based on ``iterative refinement'' where dependency modeling is achieved by a multi-pass strategy.
The proposed method is evaluated on a 28000h corpus of Japanese speech data. Performance of a conventional multi-task approach is contrasted with that of the joint model with iterative refinement.
View details
Spectral distortion model for training phase-sensitive deep-neural networks for far-field speech recognition
Chanwoo Kim
Rajeev Nongpiur
ICASSP 2018 (2018)
Preview abstract
In this paper, we present an algorithm which introduces phaseperturbation
to the training database when training phase-sensitive
deep neural-network models. Traditional features such as log-mel or
cepstral features do not have have any phase-relevant information.
However more recent features such as raw-waveform or complex
spectra features contain phase-relevant information. Phase-sensitive
features have the advantage of being able to detect differences in
time of arrival across different microphone channels or frequency
bands. However, compared to magnitude-based features, phase
information is more sensitive to various kinds of distortions such
as variations in microphone characteristics, reverberation, and so
on. For traditional magnitude-based features, it is widely known
that adding noise or reverberation, often called Multistyle-TRaining
(MTR) , improves robustness. In a similar spirit, we propose an algorithm
which introduces spectral distortion to make the deep-learning
model more robust against phase-distortion. We call this approach
Spectral-Distortion TRaining (SDTR) and Phase-Distortion TRaining
(PDTR). In our experiments using a training set consisting of
22-million utterances, this approach has proved to be quite successful
in reducing Word Error Rates in test sets obtained with real
microphones on Google Home
View details
State-of-the-art Speech Recognition With Sequence-to-Sequence Models
Chung-Cheng Chiu
Patrick Nguyen
Katya Gonina
Navdeep Jaitly
Jan Chorowski
ICASSP (2018) (to appear)
Preview abstract
Attention-based encoder-decoder architectures such as Listen, Attend, and Spell (LAS), subsume the acoustic, pronunciation and language model components of a traditional automatic speech recognition (ASR) system into a single neural network. In our previous work, we have shown that such architectures are comparable to state-of-the-art ASR systems on dictation tasks, but it was not clear if such architectures would be practical for more challenging tasks such as voice search. In this work, we explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore techniques such as synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy. We present results with a unidirectional LSTM encoder for streaming recognition. On a 12,500 hour voice search task, we find that the proposed changes improve the WER of the LAS system from 9.2% to 5.6%, while the best conventional system achieve 6.7% WER. We also test both models on a dictation dataset, and our model provide 4.1% WER while the conventional system provides 5% WER.
View details
Preview abstract
In this paper, we present an algorithm called Reliable Mask Selection-
Phase Difference Channel Weighting (RMS-PDCW) which selects
the target source masked by a noise source using the Angle of Arrival
(AoA) information calculated using the phase difference informa-
tion. The RMS-PDCW algorithm selects masks to apply using the
information about the localized sound source and the onset detec-
tion of speech. We demonstrate that this algorithm shows relatively
5.3 percent improvement over the baseline acoustic model, which
was multistyle-trained using 22 million utterances on the simulated
test set consisting of real-world and interfering-speaker noise with
reverberation time distribution between 0 ms and 900 ms and SNR
distribution between 0 dB up to clean.
View details
TOWARD DOMAIN-INVARIANT SPEECH RECOGNITION VIA LARGE SCALE TRAINING
Mohamed (Mo) Elfeky
SLT, IEEE (2018)
Preview abstract
Current state-of-the-art automatic speech recognition systems are trained to work in specific ‘domains’, defined based on factors like application, sampling rate and codec. When such recognizers are used in conditions that do not match the training domain, performance significantly drops. In this paper, we explore the idea of building a single domain-invariant model that works well for varied use-cases. We do this by combining large scale training data from multiple application domains. Our final system is trained using 162,000 hours of speech. Additionally, each utterance is artificially distorted during training to simulate effects like background noise, codec distortion, and sampling rates. Our results show that, even at such a scale, a model thus trained works almost as well as those fine-tuned to specific subsets: A single model can be trained to be robust to multiple application domains, and other variations like codecs and noise. Such models also generalize better to unseen conditions and allow for rapid adaptation to new domains – we show that by using as little as 10 hours of data for adapting a domain-invariant model to a new domain, we can match performance of a domain-specific model trained from scratch using roughly 70 times as much data. We also highlight some of the limitations of such models and areas that need addressing in future work.
View details
From audio to semantics: Approaches to end-to-end spoken language understanding
Galen Chuang
Delia Qu
Spoken Language Technology Workshop (SLT), 2018 IEEE
Preview abstract
Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to text, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of intents and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem. We propose and compare various encoder-decoder based approaches that optimizes both modules jointly, in an end-to-end manner. We evaluate these methods on a real-world task. Our results show that having an intermediate text representation while jointly optimizing the full system improves accuracy of prediction.
View details
Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition
Interspeech (2018), pp. 892-896
Preview abstract
Domain robustness is a challenging problem for automatic speech recognition (ASR). In this paper, we consider speech data collected for different applications as separate domains and investigate the robustness of acoustic models trained on multi-domain data on unseen domains. Specifically, we use Factorized Hidden Layer (FHL) as a compact low-rank representation to adapt a multi-domain ASR system to unseen domains. Experimental results on two unseen domains show that FHL is a more effective adaptation method compared to selectively fine-tuning part of the network, without dramatically increasing the model parameters. Furthermore, we found that using singular value decomposition to initialize the low-rank bases of an FHL model leads to a faster convergence and improved performance.
View details
Preview abstract
This article introduces and evaluates Sampled Connectionist
Temporal Classification (CTC) which connects the CTC criterion to
the Cross Entropy (CE) objective through sampling. Instead of com-
puting the logarithm of the sum of the alignment path likelihoods,
at each training step the sampled CTC only computes the CE loss be-
tween the sampled alignment path and model posteriors. It is shown
that the sampled CTC objective is an unbiased estimator of an upper
bound for the CTC loss, thus minimization of the sampled CTC is
equivalent to the minimization of the upper bound of the CTC ob-
jective. The definition of the sampled CTC objective has the advan-
tage that it is scalable computationally to the massive datasets using
accelerated computation machines. The sampled CTC is compared
with CTC in two large-scale speech recognition tasks and it is shown
that sampled CTC can achieve similar WER performance of the best
CTC baseline in about one fourth of the training time of the CTC
baseline.
View details
Raw Multichannel Processing Using Deep Neural Networks
Kean Chin
Chanwoo Kim
New Era for Robust Speech Recognition: Exploiting Deep Learning, Springer (2017)
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this chapter, we perform multi-channel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction. Next, we show how performance can be improved by factoring the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition. We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs. Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5% compared to a traditional beamforming-based multichannel ASR system and more than 10% compared to a single channel waveform model.
View details
Preview abstract
In this paper, we describe how to efficiently implement an acoustic room simulator to generate large-scale simulated data for training deep neural networks. Even though Google Room Simulator in [1] was shown to be quite effective in reducing the Word Error Rates (WERs) for far-field applications by generating simulated far-field training sets, it requires a very large number of Fast Fourier Transforms (FFTs) of large size. Room Simulator in [1] used approximately 80 percent of Central Processing Unit (CPU) usage in our CPU + Graphics Processing Unit (GPU) training architecture [2]. In this work, we implement an efficient OverLap Addition (OLA) based filtering using the open-source FFTW3 library. Further, we investigate the effects of the Room Impulse Response (RIR) lengths. Experimentally, we conclude that we can cut the tail portions of RIRs whose power is less than 20 dB below the maximum power without sacrificing the speech recognition accuracy. However, we observe that cutting RIR tail more than this threshold harms the speech recognition accuracy for rerecorded test sets. Using these approaches, we were able to reduce CPU usage for the room simulator portion down to 9.69 percent in CPU/GPU training architecture. Profiling result shows that we obtain 22.4 times speed-up on a single machine and 37.3 times speed up on Google's distributed training infrastructure.
View details
Acoustic Modeling for Google Home
Joe Caroselli
Kean Chin
Chanwoo Kim
Mitchel Weintraub
Erik McDermott
INTERSPEECH 2017 (2017)
Preview abstract
This paper describes the technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016. Technical advances include an adaptive dereverberation frontend, the use of neural network models that do multichannel processing jointly with acoustic modeling, and grid lstms to model frequency variations. On the system level, improvements include adapting the model using Google Home specific data. We present results on a variety of multichannel sets. The combination of technical and system advances result in a reduction of WER of over 18\% relative compared to the current production system.
View details
Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition
Kean Chin
Chanwoo Kim
IEEE /ACM Transactions on Audio, Speech, and Language Processing, vol. 25 (2017), pp. 965 - 979
Preview abstract
Multichannel ASR systems commonly separate speech enhancement, including localization, beamforming and postfiltering, from acoustic modeling. In this paper, we perform multichannel enhancement jointly with acoustic modeling in a deep neural network framework. Inspired by beamforming, which leverages differences in the fine time structure of the signal at different microphones to filter energy arriving from different directions, we explore modeling the raw time-domain waveform directly. We introduce a neural network architecture which performs multichannel filtering in the first layer of the network and show that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target speaker direction.
%
Next, we show how performance can be improved by \emph{factoring} the first layer to separate the multichannel spatial filtering operation from a single channel filterbank which computes a frequency decomposition.
%
We also introduce an adaptive variant, which updates the spatial filter coefficients at each time frame based on the previous inputs.
%
Finally we demonstrate that these approaches can be implemented more efficiently in the frequency domain. Overall, we find that such multichannel neural networks give a relative word error rate improvement of more than 5\% compared to a traditional beamforming-based multichannel ASR system and more than 10\% compared to a single channel waveform model.
View details
Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home
Chanwoo Kim
Kean Chin
Thad Hughes
interspeech 2017 (2017), pp. 379-383
Preview abstract
We describe the structure and application of an acoustic room
simulator to generate large-scale simulated data for training
deep neural networks for far-field speech recognition. The system
simulates millions of different room dimensions, a wide
distribution of reverberation time and signal-to-noise ratios,
and a range of microphone and sound source locations. We
start with a relatively clean training set as the source and artificially
create simulated data by randomly sampling a noise
configuration for every new training example. As a result,
the acoustic model is trained using examples that are virtually
never repeated. We evaluate performance of this approach
based on room simulation using a factored complex Fast Fourier
Transform (CFFT) acoustic model introduced in our earlier
work, which uses CFFT layers and LSTM AMs for joint multichannel
processing and acoustic modeling. Results show that
the simulator-driven approach is quite effective in obtaining
large improvements not only in simulated test conditions, but
also in real / rerecorded conditions. This room simulation system
has been employed in training acoustic models including
the ones for the recently released Google Home.
View details
Preview abstract
This article discusses strategies for end-to-end training of state-
of-the-art acoustic models for Large Vocabulary Continuous
Speech Recognition (LVCSR), with the goal of leveraging Ten-
sorFlow components so as to make efficient use of large-scale
training sets, large model sizes, and high-speed computation
units such as Graphical Processing Units (GPUs). Benchmarks
are presented that evaluate the efficiency of different approaches
to batching of training data, unrolling of recurrent acoustic
models, and device placement of TensorFlow variables and op-
erations. An overall training architecture developed in light of
those findings is then described. The approach makes it possi-
ble to take advantage of both data parallelism and high speed
computation on GPU for state-of-the-art sequence training of
acoustic models. The effectiveness of the design is evaluated
for different training schemes and model sizes, on a 20, 000
hour Voice Search task.
View details
Factored Spatial and Spectral Multichannel Raw Waveform CLDNNs
Preview
International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2016)
Preview abstract
State-of-the-art automatic speech recognition (ASR) systems
typically rely on pre-processed features. This paper studies
the time-frequency duality in ASR feature extraction methods
and proposes extending the standard acoustic model with a
complex-valued linear projection layer to learn and optimize
features that minimize standard cost functions such as cross
entropy. The proposed Complex Linear Projection (CLP) features
achieve superior performance compared to pre-processed
Log Mel features.
View details
Preview abstract
Joint multichannel enhancement and acoustic modeling using neural networks has shown promise over the past few years. However, one shortcoming of previous work [1,2,3] is that the filters learned during training are fixed for decoding, potentially limiting the ability of these models to adapt to previously unseen or changing conditions. In this paper we explore a neural network adaptive beamforming (NAB) technique to address this issue. Specifically, we use LSTM layers to predict time domain beamforming filter coefficients at each input frame. These filters are convolved with the framed time domain input signal and summed across channels, essentially performing FIR filter-and-sum beamforming using the dynamically adapted filter. The beamformer output is passed into a waveform CLDNN acoustic model [4] which is trained jointly with the filter prediction LSTM layers. We find that the proposed NAB model achieves a 12.7% relative improvement in WER over a single channel model [4] and reaches similar performance to a ``factored'' model architecture which utilizes several fixed spatial filters [3] on a 2,000-hour Voice Search task, with a 17.9% decrease in computational cost.
View details
Large Vocabulary Automatic Speech Recognition for Children
Melissa Carroll
Noah Coccaro
Qi-Ming Jiang
Interspeech (2015)
Preview abstract
Recently, Google launched YouTube Kids, a mobile application for children, that uses a speech recognizer built specifically for recognizing children’s speech. In this paper we present techniques we explored to build such a system. We describe the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results. We also compare long short-term memory (LSTM) recurrent networks to convolutional, LSTM, deep neural networks (CLDNN). We found that a CLDNN acoustic model outperforms an LSTM across a variety of different conditions, but does not specifically model child speech relatively better than adult. Overall, these findings allow us to build a successful, state-of-the-art large vocabulary speech recognizer for both children and adults.
View details
Preview abstract
In this paper, we present a new dereverberation algorithm called
Temporal Masking and Thresholding (TMT) to enhance the
temporal spectra of spectral features for robust speech recognition
in reverberant environments. This algorithm is motivated
by the precedence effect and temporal masking of human
auditory perception. This work is an improvement of our
previous dereverberation work called Suppression of Slowlyvarying
components and the falling edge of the power envelope
(SSF). The TMT algorithm uses a different mathematical
model to characterize temporal masking and thresholding compared
to the model that had been used to characterize the SSF
algorithm. Specifically, the nonlinear highpass filtering used
in the SSF algorithm has been replaced by a masking mechanism
based on a combination of peak detection and dynamic
thresholding. Speech recognition results show that the TMT
algorithm provides superior recognition accuracy compared to
other algorithms such as LTLSS, VTS, or SSF in reverberant
environments.
View details
Context Dependent State Tying for Speech Recognition using Deep Neural Network Acoustic Models
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
GMM-Free DNN Training
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2014)
Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks
Erik McDermott
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), IEEE, Firenze, Italy (2014)
Preview abstract
This paper explores asynchronous stochastic optimization for sequence training of deep neural networks. Sequence training requires more computation than frame-level training using pre-computed frame data. This leads to several complications for stochastic optimization, arising from significant asynchrony in model updates under massive parallelization, and limited data shuffling due to utterance-chunked processing. We analyze the impact of these two issues on the efficiency and performance of sequence training. In particular, we suggest a framework to formalize the reasoning about the asynchrony and present experimental results on both small and large scale Voice Search tasks to validate the effectiveness and efficiency of asynchronous stochastic optimization.
View details
Asynchronous, Online, GMM-free Training of a Context Dependent Acoustic Model for Speech Recognition
Preview
Proceedings of the European Conference on Speech Communication and Technology (2014) (to appear)
RAPID ADAPTATION FOR MOBILE SPEECH APPLICATIONS
Preview
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2013)
Preview abstract
In large vocabulary continuous speech recognition, decision trees are widely used to cluster triphone states. In addition to commonly used phonetically based questions, others have proposed additional questions such as phone position within word or syllable. This paper examines using the word or syllable context itself as a feature in the decision tree, providing an elegant way of introducing word- or syllable-specific models into the system. Positive results are reported on two state-of-the-art systems: voicemail transcription and a search by voice tasks across a variety of acoustic model and training set sizes.
View details
Preview abstract
ISCA Student panel presentation slides
View details
Restoring Punctuation and Capitalization in Transcribed Speech
Agustín Gravano
Martin Jansche
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 4741-4744
Preview abstract
Adding punctuation and capitalization greatly improves the readability of automatic speech transcripts. We discuss an approach for performing both tasks in a single pass using a purely text-based n-gram language model. We study the effect on performance of varying the n-gram order (from n = 3 to n = 6) and the amount of training data (from 58 million to 55 billion tokens). Our results show that using larger training data sets consistently improves performance, while increasing the n-gram order does not help nearly as much.
View details
An Audio Indexing System for Election Video Material
Christopher Alberti
Ari Bezman
Anastassia Drofa
Ted Power
Arnaud Sahuguet
Maria Shugrina
Proceedings of ICASSP (2009), pp. 4873-4876
Preview abstract
In the 2008 presidential election race in the United States, the prospective candidates made extensive use of YouTube to post video material. We developed a scalable system that transcribes this material and makes the content searchable (by indexing the meta-data and transcripts of the videos) and allows the user to navigate through the video material based on content. The system is available as an iGoogle gadget as well as a Labs product. Given the large exposure, special emphasis was put on the scalability and reliability of the system. This paper describes the design and implementation of this system.
View details
Confidence Scores for Acoustic Model Adaptation
Preview
C. Gollan
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2008)
MAP adaptation of stochastic grammars
Fast vocabulary-independent audio search using path-based graph indexing
Meta-data Conditional Language Modeling
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2004)
Improved name recognition with meta-data dependent name networks
S. Maskey
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2004)
Language model adaptation with MAP estimation and the perceptron algorithm
Supervised and unsupervised PCFG adaptation to novel domains
Unsupervised Language Model Adaptation
Proceedings of the International Conference on Acoustics,Speech and Signal Processing (2003)
Combining Maximum Likelihood and Maximum A Posteriori Estimation for Detailed Acoustic Modeling of Context Dependency
Proceedings of the International conference on Spoken Language Processing (2002), pp. 2593-2596
SCANMail: a voicemail interface that makes speech browsable, readable and searchable
Steve Whittaker
Julia Hirschberg
Brian Amento
Litza A. Stark
Philip L. Isenhour
Larry Stead
Gary Zamchick
Aaron Rosenberg
CHI (2002), pp. 275-282
Caller Identification for the SCANMail Voicemail Browser
A. Rosenberg
J. Hirschberg
S. Parthasarathy
P. Isenhour
L. Stead
Proceedings of the European Conference on Speech Communication and Technology (2001)
Audio Browsing and Search in the Voicemail Domain
SCANMail: Browsing and Searching Speech Data by Content
J. Hirschberg
D. Hindle
P. Isenhour
A. Rosenberg
L. Stark
L. Stead
S. Whittaker
G. Zamchick
Proceedings of the European Conference on Speech Communication and Technology (2001)
SCANMail: Audio Navigation in the Voicemail Domain
J. Hirschberg
A. Rosenberg
S. Whittaker
D. Hindle
P. Isenhour
M. Jones
L. Stark
G. Zamchick
Proceedings of the workshop on Human Language Technology (2001)
Using Maximum Likelihood Linear Regression for Segment Clustering and Speaker Identification
Proceedings of the International conference on Spoken Language Processing (2000), pp. 536-539
Joint Lexicon, Acoustic Unit Inventory and Model Design
AT&T at TREC-8
Amit Singhal
Steven P. Abney
Donald Hindle
TREC (1999)
Using Automatically-Derived Acoustic Sub-word Units in Large Vocabulary Speech Recognition
M. Ostendorf
Proceedings of the International conference on Spoken Language Processing (1998)
Joint Acoustic Unit Design and Lexicon Generation
M. Ostendorf
roceedings ESCA Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition (1998), pp. 7-12
Design of a Speech Recognition System based on Non-Uniform Segmental Units
M. Ostendorf
Y. Sagisaka
K.K. Paliwal
Proceedings of the International Conference on Acoustics, Speech and Signal Processing, IEEE (1996), pp. 443-446
Modeling Systematic Variations in Pronunciation via a Language-Dependent Hidden Speaking Mode
M. Ostendorf
B. Byrne
M. Finke
A. Gunawardana
K. Ross
S. Roweis
E. Shriberg
D. Talkin
A. Waibel
B. Wheatley
T. Zeppenfeld
Proceedings of the International conference on Spoken Language Processing (1996)
Unsupervised Learning of Non-Uniform Segmental Units for Acoustic Modeling in Speech Recognition
M. Ostendorf
Y. Sagisaka
K.K. Paliwal
Proceedings of the IEEE workshop on Automatic Speech Recognition, IEEE (1995), pp. 141-142
Simultaneous Design of Feature Extractor and Pattern Classifier using the Minimum Classification Error Training Algorithm
K.K. Paliwal
Y. Sagisaka
Proceedings of the IEEE workshop on Neural Networks for Signal Processing, IEEE (1995), pp. 67-76
Minimum Classification Error Training for Feature Extraction and Pattern Classification in Speech Recognition
Optimization of time-frequency masking filters using the minimum error classification criterion
Kiyoaki Aikawa
Proceedings of the International Conference on Acoustics, Speech and Signal Processing, IEEE (1994), pp. 485-488