Zelin Wu
Research Areas
Authored Publications
Sort By
Preview abstract
We propose a new two-pass E2E speech recognition model that improves ASR performance by training on a combination of paired data and unpaired text data. Previously, the joint acoustic and text decoder (JATD) has shown promising results through the use of text data during model training and the recently introduced deliberation architecture has reduced recognition errors by leveraging first-pass decoding results. Our method, dubbed Deliberation-JATD, combines the spelling correcting abilities of deliberation with JATD’s use of unpaired text data to further improve performance. The proposed model produces substantial gains across multiple test sets, especially those focused on rare words, where it reduces word error rate (WER) by between 12% and 22.5% relative. This is done without increasing model size or requiring multi-stage training, making Deliberation-JATD an efficient candidate for on-device applications.
View details
Preview abstract
Recurrent Neural Network Transducer (RNN-T) models [1] for automatic speech recognition (ASR) provide high accuracy speech recognition. Such end-to-end (E2E) models combine acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network, dramatically reducing complexity and model size. In this paper, we propose a technique for incorporating contextual signals, such as intelligent assistant device state or dialog state, directly into RNN-T models. We explore different encoding methods and demonstrate that RNN-T models can effectively utilize such context. Our technique results in reduction in Word Error Rate (WER) of up to 10.4% relative on a variety of contextual recognition tasks. We also demonstrate that proper regularization can be used to model context independently for improved overall quality.
View details
Speech Recognition with Augmented Synthesized Speech
Pedro Moreno
Ye Jia
Yu Zhang
ASRU 2019 (to appear)
Preview abstract
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer.
View details
Preview abstract
Recognizing written domain numeric utterances (e.g., I need
$1.25.) can be challenging for ASR systems, particularly when
numeric sequences are not seen during training. This out-ofvocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances
(e.g., I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and
then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low
memory setting of on-device speech recognition. E2E models
such as RNN-T, are attractive for on-device ASR, as they fold
the AM, PM and LM of a conventional model into one neural
network. However, in the on-device setting the large memory
footprint of an FST denormer makes spoken domain training
more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that
using a text-to-speech system to generate additional numeric
training data, as well as using a small-footprint neural network
to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest
numeric sequences, we see reduction of WER by up to a factor
of 7.in this setting forces training back into the written domain, resulting in poor model performance on numeric sequences. In
this paper, we investigate different techniques to improve E2E
model performance on numeric data. We find that by using a
text-to-speech system to generate additional training data that
emphasizes difficult numeric utterances, as well as by using
an independently-trained small-footprint neural network to perform spoken-to-written domain denorming, we achieve strong
results in several numeric classes. In the case of the longest numeric sequences, for which the OOV issue is most prevalent,
we see reduction of WER by up to a factor of 7.
View details
Preview abstract
End-to-end (E2E) models are a promising research direction in speech recognition, as the single all-neural E2E system offers a much simpler and more compact solution compared to a conventional model, which has a separate acoustic (AM), pronunciation (PM) and language model (LM).
However, it has been noted that E2E models perform poorly on tail words and proper nouns, likely because the training requires joint audio-text pairs, and does not take advantage of a large amount of text-only data used to train the LMs in conventional models.
There has been numerous efforts in training an RNN-LM on text-only data and fusing it into the end-to-end model.
In this work, we contrast this approach to training the E2E model with audio-text pairs generated from unsupervised speech data.
To target the proper noun issue specifically, we adopt a Part-of-Speech (POS) tagger to filter the unsupervised data to use only those with proper nouns.
We show that training with filtered unsupervised-data provides up to a 13% relative reduction in word-error-rate (WER), and when used in conjunction with a cold-fusion RNN-LM, up to a 17% relative improvement.
View details
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Prashant Sridhar
Ye Jia
ICASSP 2019 (2018)
Preview abstract
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
View details
Unsupervised Context Learning For Speech Recognition
Justin Scheiner
Spoken Language Technology (SLT) Workshop, IEEE (2016)
Preview abstract
It has been shown in the literature that automatic speech
recognition systems can greatly benefit from contextual in-
formation [ref]. The contextual information can be used to
simplify the search and improve recognition accuracy. The
types of useful contextual information can include the name
of the application the user is in, the contents on the user’s
phone screen, user’s location, a certain dialog state, etc.
Building a separate language model for each of these types
of context is not feasible due to limited resources or limited
amount of training data.
In this paper we describe an approach for unsupervised
learning of contextual information and automatic building of
contextual (biasing) models. Our approach can be used to
build a large number of small contextual models from a lim-
ited amount of available unsupervised training data. We de-
scribe how n-grams relevant for a particular context are au-
tomatically selected as well as how an optimal size of a final
contextual model built is chosen. Our experimental results
show great accuracy improvements for several types of con-
text.
View details