Jump to Content
Shuo-yiin Chang

Shuo-yiin Chang

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Text injection for automatic speech recognition (ASR), wherein unpaired text-only data is used to supplement paired audio-text data, has shown promising improvements for word error rate. This study examines the use of text injection for auxiliary tasks, which are the non-ASR tasks often performed by an E2E model. In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. The first is capitalization, which is a de-normalization task. The second is turn-taking prediction, which attempts to identify whether a user has completed their conversation turn in a digital assistant interaction. We show results demonstrating that our text injection method boosts capitalization performance for long-tail data, and improves turn-taking detection recall. View details
    Preview abstract Improving the performance of end-to-end ASR models on long utterances of minutes to hours is an ongoing problem in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundaries based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set a alarm for... 5 o'clock"). Here, we propose replacing the VAD with an end-to-end ASR model capable of predicting segment boundaries, allowing the segmentation to be conditioned not only on deeper acoustic features but also on linguistic features from the decoded text, while requiring negligible extra compute. In experiments on real world long-form audio (YouTube) of up to 30 minutes long, we demonstrate WER gains of 5\% relative to the VAD baseline on a state-of-the-art Conformer RNN-T setup. View details
    Preview abstract Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. This EP model strongly affects latency, but is subject to computational constraints, which limits prediction accuracy. We propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This allows flexibility during inference to produce a low-cost prediction or a higher quality prediction if ASR computation is ongoing. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 130ms (33.3% reduction), and 90th percentile latency by 160ms (22.2% reduction), without regressing word-error rate. For continuous recognition, WER improves by 10.6% (relative). View details
    Preview abstract Streaming automatic speech recognition (ASR) aims to output each hypothesized word as quickly and accurately as possible. However, reducing latency while retaining accuracy is highly challenging. Existing approaches including Early and Late Penalties~\cite{li2020towards} and Constrained Alignment~\cite{sainath2020emitting} penalize emission delay by manipulating per-token or per-frame RNN-T output logits. While being successful in reducing latency, these approaches lead to significant accuracy degradation. In this work, we propose a sequence-level emission regularization technique, named FastEmit, that applies emission latency regularization directly on the transducer forward-backward probabilities. We demonstrate that FastEmit is more suitable to the sequence-level transducer~\cite{Graves12} training objective for streaming ASR networks. We apply FastEmit on various end-to-end (E2E) ASR networks including RNN-Transducer~\cite{Ryan19}, Transformer-Transducer~\cite{zhang2020transformer}, ConvNet-Transducer~\cite{han2020contextnet} and Conformer-Transducer~\cite{gulati2020conformer}, and achieve 150-300ms latency reduction over previous art without accuracy degradation on a Voice Search test set. FastEmit also improves streaming ASR accuracy from 4.4%/8.9% to 3.1%/7.5% WER, meanwhile reduces 90th percentile latency from 210 ms to only 30 ms on LibriSpeech. View details
    Preview abstract On-device end-to-end (E2E) models have shown improvementsover a conventional model on Search test sets in both quality, as measured by Word Error Rate (WER), and latency, measured by the time the result is finalized after the user stops speaking. However, the E2E model is trained on a small fraction of audio-text pairs compared to the 100 billion text utterances that a conventional language model (LM) is trained with. Thus E2E models perform poorly on rare words and phrases. In this paper, building upon the two-pass streaming Cascaded Encoder E2E model, we explore using a Hybrid Autoregressive Transducer (HAT) factorization to better integrate an on-device neural LM trained on text-only data. Furthermore, to further improve decoder latency we introduce a non-recurrent embedding decoder, in place of the typical LSTM decoder, into the Cascaded Encoder model. Overall, we present a streaming on-device model that incorporates an external neural LM and outperforms the conventional model in both search and rare-word quality, as well as latency, and is 318X smaller. View details
    Preview abstract Thus far, end-to-end (E2E) models have not shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size. View details
    Preview abstract Latency is a crucial metric for streaming speech recognition systems. In this paper, we reduce latency by fetching responses early based on the partial recognition results and refer to it as prefetching. Specifically, prefetching works by submitting partial recognition results for subsequent processing such as obtaining assistant server responses or second-pass rescoring before the recognition result is finalized. If the partial result matches the final recognition result, the early fetched response can be delivered to the user instantly. This effectively speeds up the system by saving the execution latency that typically happens after recognition is completed. Prefetching can be triggered multiple times for a single query, but this leads to multiple rounds of downstream processing and increases the computation costs. It is hence desirable to fetch the result sooner but meanwhile limiting the number of prefetches. To achieve the best trade-off between latency and computation cost, we investigated a series of prefetching decision models including decoder silence based prefetching, acoustic silence based prefetching and end-to-end prefetching. In this paper, we demonstrate the proposed prefetching mechanism reduced 200 ms for a system that consists of a streaming first pass model using recurrent neural network transducer (RNN-T) and a non-streaming second pass rescoring model using Listen, Attend and Spell (LAS) [1]. We observe that the endto-end prefetching provides the best trade-off between cost and latency that is 100 ms faster compared to silence based prefetching at a fixed prefetch rate. View details
    Preview abstract End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, Recurrent neural network transducer (RNN-T) as a streaming E2E model that has shown promising potential for on-device ASR. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique, we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction on a Voice search test set. We also experimented with a second pass Listen, Attend and Spell (LAS) rescorer for the RNN-T EP model. Although it cannot directly improve the first pass latency, the large WER reduction actually give us more room to trade WER for latency. RNN-T+LAS, together with EMBR training brings in 17.3% relative WER reduction while maintaining similar 120ms 90-percentile latency reductions. View details
    Preview abstract In this paper, we propose "personal VAD'', a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption. We achieve this by training a VAD-alike neural network which is conditioned on the target speaker embedding or the speaker verification score. For every frame, personal VAD outputs the scores for three classes: non-speech, target speaker speech, and non-target speaker speech. With our optimal setup, we are able to train a 130KB model that out-performs a baseline system where individually trained standard VAD and speaker recognition network are combined to perform the same task. View details
    Preview abstract End-to-end (E2E) models, which directly predict output character sequences given input speech, are good candidates for on-device speech recognition. E2E models, however, present numerous challenges: In order to be truly useful, such models must decode speech utterances in a streaming fashion, in real time; they must be robust to the long tail of use cases; they must be able to leverage user-specific context (e.g., contact lists); and above all, they must be extremely accurate. In this work, we describe our efforts at building an E2E speech recognizer using a recurrent neural network transducer. In experimental evaluations, we find that the proposed approach can outperform a conventional CTC-based model in terms of both latency and accuracy in a number of evaluation categories. View details
    Preview abstract In speech recognition systems, we generally differentiate between long-form speech and voice queries, where endpointers are responsible for speech detection and query endpoint detection respectively. Detection of speech is useful for segmentation and pre-filtering in long-form speech processing. On the other hand, query endpoint detection predicts when to stop listening and send audio received so far for actions. It thus determines system latency and is an essential component for interactive voice systems. For both tasks, endpointer needs to be robust in challenging environments, including noisy conditions, reverberant environments and environments with background speech, and it has to generalize well to different domains with different speaking styles and rhythms. This work investigates building a unified endpointer by folding the separate speech detection and query endpoint detection tasks into a single neural network model through multitask learning. A categorical domain representation is further incorporated into the model to encourage learning domain specific information. The final unified model achieves around 100 ms (18% relatively) latency improvement for near-field voice queries and 150 ms (21% relatively) for far-field voice queries over simply pooling all the data together and 7% relative frame error rate reduction for long-form speech compared to a standalone speech detection model. The proposed approach also shows good robustness to noisy environments and yields 180 ms latency improvement on voice queries from an unseen domain. View details
    Preview abstract The tradeoff between word error rate (WER) and latency is very important for online automatic speech recognition (ASR) applications. We want the system to endpoint and close the microphone as quickly as possible, without degrading WER. For conventional ASR systems, endpointing is a separate model from the acoustic, pronunciation and language models (AM, PM, LM), which can often cause endpointer problems, with either a higher WER or larger latency. In going with the all-neural spirit of end-to-end (E2E) models, which fold the AM, PM and LM into one neural network, in this work we look at foldinging the endpointer into the model. On a large vocabulary Voice Search task, we show that joint optimization of the endpointer with the E2E model results in no quality degradation and reduces latency by more than a factor of 2 compared to having a separate endpointer with the E2E model. View details
    Preview abstract Voice-activity-detection (VAD) is the task of predicting where in the utterance is speech versus background noise. It is an important first step to determine when to open the microphone (i.e., start-of- speech) and close the microphone (i.e., end-of-speech) for streaming speech recognition applications such as Voice Search. Long short- term memory neural networks (LSTMs) have been a popular archi- tecture for sequential modeling for acoustic signals, and have been successfully used for many VAD applications. However, it has been observed that LSTMs suffer from state saturation problems when the utterance is long (i.e., for voice dictation tasks), and thus requires the LSTM state to be periodically reset. In this paper, we propse an alter- native architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN). The proposed architecture differs from con- ventional CNNs in three respects (1) dilated causal convolution, (2) gated activations and (3) residual connections. Results on a Google Voice Typing task shows that the proposed architecture achieves 14% rela- tive FA improvement at a FR of 1% over state-of-the-art LSTMs for VAD task. We also include detailed experiments investigating the factors that distinguish the proposed architecture from conventional convolution. View details
    Preview abstract In many streaming speech recognition applications such as voice search it is important to determine quickly and accurately when the user has finished speaking their query. A conventional approach to this task is to declare end-of-query whenever a fixed interval of silence is detected by a voice activity detector (VAD) trained to classify each frame as speech or silence. However silence detection and end-of-query detection are fundamentally different tasks, and the criterion used during VAD training may not be optimal. In particular the conventional approach ignores potential acoustic cues such as filler sounds and past speaking rate which may indicate whether a given pause is temporary or query-final. In this paper we present a simple modification to make the conventional VAD training criterion more closely related to end-of-query detection. A unidirectional long short-term memory architecture allows the system to remember past acoustic events, and the training criterion incentivizes the system to learn to use any acoustic cues relevant to predicting future user intent. We show experimentally that this approach improves latency at a given accuracy for end-of-query detection for voice search. View details
    Preview abstract The task of endpointing is to determine when the user has finished speaking, which is important for interactive speech applications such as voice search and Google Home. In this paper, we propose a GLDNN-based (grid long short-term memory, deep neural network) endpointer model and show that it provides significant improvements over a state-of-the-art CLDNN (convolutional, long short-term memory, deep neural networks) model. Specifically, we replace the convolution layer with a grid LSTM layer that models both spectral and temporal variations through recurrent connections. Results show that the GLDNN achieves 39% relative improvement in false alarm rate at a fixed false reject rate of 2%, and reduces median latency by 11%. We also include detailed experiments investigating why grid LSTMs offer better performance than CLDNNs. Analysis reveals that the recurrent connection along the frequency axis is an important factor that greatly contributes to the performance of grid LSTMs, especially in the presence of background noise. Finally, we also show that multichannel input further increases robustness to background speech. Overall, we achieved 16% (100 ms) endpointer latency improvement relative to our previous best model. View details
    No Results Found